Removing text with certain pattern in cell

Removing text with certain pattern in cell - excel

I want to remove part of text in a cell with pattern such as [cid:image003.gif#01D863CC.CAE51sd0] & [https://xxxx=0].
It may appear several times in each cell randomly in different position.
I read some material (Remove Text Within Cell Starting with String and Ending with Character); but I have no clue how to handle the code with looping and line by line within a cell
I prepared 2 samples.
Sample A:
Hi xxx,
This is Ken
[cid:image003.gif#01D863CC.CAE51sd0]
[https://xxxx=0]
[cid:imagedddd0]
Expected:
Hi xxx,
This is Ken
Sample B:
[cid:image003.gif#01D863CC.CAE51sd0]
[https://xxxx=0]
Hi xxx,
This is Ken
[cid:imagedddd0]
Expected:
Hi xxx,
This is Ken

If I understand you correctly, you want to do something to any cell which has certain character, "[" and "]". What you want to do to that kind of cell, you want to remove all those "[", "]" and the value in between those two certain characters.
Example data in the active sheet:
The cell with that kind of data in yellow is scattered around to whatever cells in the active sheet.
if your data is similar with image above and the image below is your expected result after running the sub:
then the sub is something like this :
Sub test()
Dim c As Range
Dim pos1 As Long: Dim pos2 As Long
Do
Set c = ActiveSheet.UsedRange.Find("[", LookAt:=xlPart)
If Not c Is Nothing Then
Do
pos1 = InStr(c.Value, "["): If pos1 = 0 Then Exit Do
pos2 = InStr(c.Value, "]")
c.Replace What:=Mid(c.Value, pos1, pos2 - pos1 + 1), Replacement:="", LookAt:=xlPart
Loop
End If
Loop Until c Is Nothing
End Sub
There are two loops in the sub.
Loop-A is to find any cell of the active sheet which has "[" character and have it as c variable
This loop-A will stop when it doesn't find a cell which has "[" char.
Loop-B is to do something whenever there is "[" in the found cell.
This loop-B will stop if in the found cell there is no more "[" char.
What the sub do in this loop-B is to find the position of "[" as pos1 variable and find the position of "]" as pos2 variable. Then it replace the "[", the "]"
and whatever text in between those two char in the found cell (the c variable) value with nothing ("").
After seeing the sample data, I think it's better to do it in MS Words app. So I search the internet on how to do VBA in MS Words app. Not exactly sure if it's a correct syntax, but it seems the code below (MS Word VBA module) work as expected.
Sub test()
Dim pos1 As Long: Dim pos2 As Long
Dim txt As String: Dim slice As String: Dim rpl As String
Do
pos1 = InStr(ActiveDocument.Content, "[")
If pos1 = 0 Then Exit Do
pos2 = InStr(ActiveDocument.Content, "]")
txt = Mid(ActiveDocument.Content, pos1, pos2 - pos1 + 1)
If Len(txt) > 250 Then
slice = Left(txt, 250): rpl = "["
Else
slice = txt: rpl = ""
End If
With ActiveDocument.Content.Find
.Execute FindText:=slice, ReplaceWith:=rpl, _
Format:=True, Replace:=wdReplaceAll
End With
Loop
End Sub
The process of the sub is similar with the one in Excel app. The difference is, this sub check if the char is more then 250 in the text to be removed (the txt variable) then it will slice it the first 250 char into slice variable, and have "[" as the replacement into rpl variable.
If the txt variable is not more than 250 char, then the slice variable value is the same with the txt variable value, while the rpl variable value then it's just directly nothing ---> "".
WARNING:
In my computer, the sub takes almost 2 minutes to finish the job in MS Words app with the data coming from column A of your Excel sheet sample data.

Related

vba remove comma without removing strikethough

How am I able to remove the comma without removing the strikethrough format
Example: C418, C419, C420 , C421, C422, C423, C424
Expected Result: C418 C419 C420 C421 C422 C423 C424
Final Result: C418, C419 C420 C421 C422 C423 C424
I am checking to see if that cell contain a strikethrough. By using the Function I am able to detect it. But once I try to remove the comma by using the replace function and replace comma with a blank. The format for the strikethrough will be remove causing the function not to work which will result in a different outcome.
I will like to use the space delimiter to match with the other cell so that I can split the cell value afterwards
If HasStrikethrough(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB)) = True Then
BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value = Replace(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value, ",", "")
BOMCk.Sheets("Filtered RO BOM").Range("G" & LCB).Value = "strike-off"
ElseIf HasStrikethrough(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB)) = False Then
BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value = Replace(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value, ",", "")
End If
Function HasStrikethrough(rng As Range) As Boolean
Dim i As Long
With rng(1)
For i = 1 To .Characters.Count
If .Characters(i, 1).Font.StrikeThrough Then
HasStrikethrough = True
Exit For
End If
Next i
End With
End Function

Range.Characters only works if the cells value is 255 characters or less.
Range.Characters(i, 1).Delete will delete the commas. Make sure to iterate from the last position to the first position when deleting.
Sub RemoveCommas(ByVal Target As Range)
If Target.Characters.Count > 255 Then
MsgBox "Range.Characters only works with String with 255 or less Characters", vbCritical, "String too long"
Exit Sub
End If
Dim n As Long
For n = Target.Characters.Count To 1 Step -1
If Target.Characters(n, 1).Text = "," Then Target.Characters(n, 1).Delete
Next
End Sub

Alternative via xlRangeValueXMLSpreadsheet Value
The ►.Value(11) approach solves the question by a very simple string replacement (though the xml string handling can reveal to be very complicated in many other cases):
Sub RemoveCommata(rng As Range, Optional colOffset As Long = 1)
'a) Get range data as xml spreadsheet value
Dim xmls As String: xmls = rng.Value(xlRangeValueXMLSpreadsheet) ' //alternatively: xmls = rng.Value(11)
'b) find start position of body
Dim pos As Long: pos = InStr(xmls, "<Worksheet ")
'c) define xml spreadsheet parts and remove commata in body
Dim head As String: head = Left(xmls, pos - 1)
Dim body As String: body = Replace(Mid(xmls, pos), ",", "")
'd) write cleaned range back
rng.Offset(0, colOffset).Value(11) = head & body
End Sub
Help reference links
Excel XlRangeValueDataType enumeration
Excel Range Value

Swapping delimited strings in an excel column

I have a column in a very large excel spreadsheet that is in some cases incorrectly formatted. It should contain first a street address, then a name, separated by a hyphen, as shown:
123 Main St-Smith
However, some are formatted in reverse, such as:
Jones-231 High St
All the addresses start with a numeric and all the names start with an alpha. I am looking for a macro or code that would swap only the name and address where it is incorrectly formatted. I have tried turning it into a comma delimited to separate them out, but since they only occur intermittently I am still left with fixing them one by one manually.
Any suggestions? I am by no means an Excel macro expert. Thanks!

Split the string on the hyphen then look for spaces in the second element.
dim i as long, tmp as variant
with worksheets("sheet1")
for i = 2 to .cells(.rows.count, "a").end(xlup).row
tmp = split(.cells(i, "a").value2, "-")
if cbool(instr(1, tmp(1), " ")) then _
.cells(i, "a") = join(array(tmp(1), tmp(0)), "-")
next i
end with

As you wrote
Street name is any string that begins with a digit and ends with either a hyphen or the end of the string
Name is any string that starts with a non-digit and ends with either a hyphen or the end of the string
This can be done using just native VBA, (although at first I was going to use Regular Expressions)
split on the hyphen
rearrange depending on if first starts with a number or not
do some error checking in case no hyphen present or don't have the number and non-number start as specified.
Option Explicit
Function fmtAddressName2(S As String) As String
Dim sAddr As String, sName As String
Dim v As Variant
v = Split(S, "-")
On Error GoTo badFormat
If IsNumeric(Left(v(0), 1)) And Not IsNumeric(Left(v(1), 1)) Then
sAddr = v(0)
sName = v(1)
ElseIf Not IsNumeric(Left(v(0), 1)) And IsNumeric(Left(v(1), 1)) Then
sAddr = v(1)
sName = v(0)
Else
GoTo badFormat
End If
fmtAddressName2 = sAddr & "-" & sName
Exit Function
badFormat:
'return unchanged string
fmtAddressName2 = S
'or could return an error message
End Function

Remove words that contain each other and leave the longer one

I'm looking for a macro (preferably a function) that would take cell contents, split it into separate words, compare them to one another and remove the shorter words.
Here's an image of what I want the output to look like (I need the words that are crossed out removed):
I tried to write a macro myself, but it doesn't work 100% properly because it's not taking the last words and sometimes removes what shouldn't be removed. Also, I have to do this on around 50k cells, so a macro takes a lot of time to run, that's why I'd prefer it to be a function. I guess I shouldn't use the replace function, but I couldn't make anything else work.
Sub clean_words_containing_eachother()
Dim sht1 As Worksheet
Dim LastRow As Long
Dim Cell As Range
Dim cell_value As String
Dim word, word2 As Variant
Set sht1 = ActiveSheet
col = InputBox("Which column do you want to clear?")
LastRow = sht1.Cells(sht1.Rows.Count, col).End(xlUp).Row
Let to_clean = col & "2:" & col & LastRow
For i = 2 To LastRow
For Each Cell In sht1.Range(to_clean)
cell_value = Cell.Value
cell_split = Split(cell_value, " ")
For Each word In cell_split
For Each word2 In cell_split
If word <> word2 Then
If InStr(word2, word) > 0 Then
If Len(word) < Len(word2) Then
word = word & " "
Cell = Replace(Cell, word, " ")
ElseIf Len(word) > Len(word2) Then
word2 = word2 & " "
Cell = Replace(Cell, word2, " ")
End If
End If
End If
Next word2
Next word
Next Cell
Next i
End Sub

Assuming that the retention of the third word in your first example is an error, since books is contained later on in notebooks:
5003886 book books bound case casebound not notebook notebooks office oxford sign signature
and also assuming that you would want to remove duplicate identical words, even if they are not contained subsequently in another word, then we can use a Regular Expression.
The regex will:
Capture each word
look-ahead to see if that word exists later on in the string
if it does, remove it
Since VBA regexes cannot also look-behind, we work-around this limitation by running the regex a second time on the reversed string.
Then remove the extra spaces and we are done.
Option Explicit
Function cleanWords(S As String) As String
Dim RE As Object, MC As Object, M As Object
Dim sTemp As String
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b(\w+)\b(?=.*\1)"
.ignorecase = True
'replace looking forward
sTemp = .Replace(S, "")
' check in reverse
sTemp = .Replace(StrReverse(sTemp), "")
'return to normal
sTemp = StrReverse(sTemp)
'Remove extraneous spaces
cleanWords = WorksheetFunction.Trim(sTemp)
End With
End Function
Limitations
punctuation will not be removed
a "word" is defined as containing only the characters in the class [_A-Za-z0-9] (letters, digits and the underscore).
if any words might be hyphenated, or contain other non-word characters
in the above, they will be treated as two separate words
if you want it treated as a single word, then we might need to change the regex

General steps:
Write cell to array (already working)
for each element (x), go through each element (y) (already working)
if x is in y AND y is longer that x THEN set x to ""
concat array back into string
write string to cell
String/array manipulations are much faster than operations on cells, so this will give you some increase in performance (depending on the amount of words you need to replace for each cell).
The "last word problem" might be that you dont have a space after the last word within your cells, since you only replace word + " " with " ".

vba search replace character

I'm trying to prepare a spreadsheet for a report in excel vba. Unforturnately there are some wierd characters here that need to be replaced. Easy enough, except for this chracter:
¦
I can't seem to be able to paste that character into the editor into a string replace function. When I try, the output is _. I then thought to refer to it by it's Chr code. A quick look up said it was Chr(166). http://www.gtwiki.org/mwiki/?title=VB_Chr_Values
Replace(s, "â€" + Chr(166), "...")
But this is not that character at all (at least on Mac excel). I tried:
For i = 1 To 255
Debug.Print Chr(i)
Next i
And I didn't see this character anywhere. Does anyone know how I can reference this character in vba code in order to replace it?

Not sure if regexp is available for vba-mac, but you could simplify your existing code greatly as below.
Uses a sample Strin
Dim strIn As String
strIn = "1â€1â€x123"
Do While InStr(strIn, "â€") > 0
Mid$(strIn, InStr(strIn, "â€"), 3) = "..."
Loop

Click on a cell containing your miscreant character and run this small macro:
Sub WhatIsIt()
Dim s As String, mesage As String
Dim L As Long
s = ActiveCell.Text
L = Len(s)
For i = 1 To L
ch = Mid(s, i, 1)
cd = Asc(ch)
mesage = mesage & ch & " " & cd & vbCrLf
Next i
MsgBox mesage
End Sub
It should reveal the characters in the cell and their codes.

It's dirty, but here's the workaround that I used to solve this problem. I knew that my issue character was always after "â€", so the idea was to replace the character that came after those 2. I don't really know how to replace a character at a position in a string, so my idea was to covert the string to an array of characters and replace the array at those specific indexes. Here's what it looks like:
Do While InStr(s, "â€") > 1
num2 = InStr(s, "â€")
arr = stringToArray(s)
arr(num2 - 1) = "<~>"
arr(num2) = "<~>"
arr(num2 + 1) = "<~>"
s = Replace(arrayToString(arr), "<~><~><~>", "...")
Loop
...
Function stringToArray(ByVal my_string As String) As Variant
Dim buff() As String
ReDim buff(Len(my_string) - 1)
For i = 1 To Len(my_string)
buff(i - 1) = Mid$(my_string, i, 1)
Next
stringToArray = buff
End Function
Function arrayToString(ByVal arr As Variant) As String
Dim s As String
For Each j In arr
s = s & j
Next j
arrayToString = s
End Function
In practice, what I replaced those indexes with is something that had to be unique but recognizable. Then i can replace my unique characters with whatever I want. There are sure to be edge cases, but for now it gets the job done. stringToArray function pulled from: Split string into array of characters?

How to extract text within a string of text

I have a simple problem that I'm hoping to resolve without using VBA but if that's the only way it can be solved, so be it.
I have a file with multiple rows (all one column). Each row has data that looks something like this:
1 7.82E-13 >gi|297848936|ref|XP_00| 4-hydroxide gi|297338191|gb|23343|randomrandom
2 5.09E-09 >gi|168010496|ref|xp_00| 2-pyruvate
etc...
What I want is some way to extract the string of numbers that begin with "gi|" and end with a "|". For some rows this might mean as many as 5 gi numbers, for others it'll just be one.
What I would hope the output would look like would be something like:
297848936,297338191
168010496
etc...

Here is a very flexible VBA answer using the regex object. What the function does is extract every single sub-group match it finds (stuff inside the parenthesis), separated by whatever string you want (default is ", "). You can find info on regular expressions here: http://www.regular-expressions.info/
You would call it like this, assuming that first string is in A1:
=RegexExtract(A1,"gi[|](\d+)[|]")
Since this looks for all occurance of "gi|" followed by a series of numbers and then another "|", for the first line in your question, this would give you this result:
297848936, 297338191
Just run this down the column and you're all done!
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function

Here it is (assuming data is in column A)
=VALUE(LEFT(RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2),
FIND("|",RIGHT(A1,LEN(A1) - FIND("gi|",A1) - 2)) -1 ))
Not the nicest formula, but it will work to extract the number.
I just noticed since you have two values per row with output separated by commas. You will need to check if there is a second match, third match etc. to make it work for multiple numbers per cell.
In reference to your exact sample (assuming 2 values maximum per cell) the following code will work:
=IF(ISNUMBER(FIND("gi|",$A1,FIND("gi|", $A1)+1)),CONCATENATE(LEFT(RIGHT($A1,LEN($A1)
- FIND("gi|",$A1) - 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ),
", ",LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1)
- 2),FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1,FIND("gi|", $A1)+1) - 2))
-1 )),LEFT(RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2),
FIND("|",RIGHT($A1,LEN($A1) - FIND("gi|",$A1) - 2)) -1 ))
How's that for ugly? A VBA solution may be better for you, but I'll leave this here for you.
To go up to 5 numbers, well, study the pattern and recurse manually in the formula. IT will get long!

I'd probably split the data first on the | delimiter using the convert text to columns wizard.
In Excel 2007 that is on the Data tab, Data Tools group and then choose Text to Columns. Specify Other: and | as the delimiter.
From the sample data you posted it looks like after you do this the numbers will all be in the same columns so you could then just delete the columns you don't want.

As the other guys presented the solution without VBA... I'll present the one that does use. Now, is your call to use it or no.
Just saw that #Issun presented the solution with regex, very nice! Either way, will present a 'modest' solution for the question, using only 'plain' VBA.
Option Explicit
Option Base 0
Sub findGi()
Dim oCell As Excel.Range
Set oCell = Sheets(1).Range("A1")
'Loops through every row until empty cell
While Not oCell.Value = ""
oCell.Offset(0, 1).Value2 = GetGi(oCell.Value)
Set oCell = oCell.Offset(1, 0)
Wend
End Sub
Private Function GetGi(ByVal sValue As String) As String
Dim sResult As String
Dim vArray As Variant
Dim vItem As Variant
Dim iCount As Integer
vArray = Split(sValue, "|")
iCount = 0
'Loops through the array...
For Each vItem In vArray
'Searches for the 'Gi' factor...
If vItem Like "*gi" And UBound(vArray) > iCount + 1 Then
'Concatenates the results...
sResult = sResult & vArray(iCount + 1) & ","
End If
iCount = iCount + 1
Next vItem
'And removes trail comma
If Len(sResult) > 0 Then
sResult = Left(sResult, Len(sResult) - 1)
End If
GetGi = sResult
End Function

open your excel in Google Sheets and use the regular expression with REGEXEXTRACT
Sample Usage
=REGEXEXTRACT("My favorite number is 241, but my friend's is 17", "\d+")
Tip: REGEXEXTRACT will return 241 in this example because it returns the first matching case.
In your case
=REGEXEXTRACT(A1,"gi[|](\d+)[|]")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing text with certain pattern in cell - excel

Related

vba remove comma without removing strikethough

Swapping delimited strings in an excel column

Remove words that contain each other and leave the longer one

vba search replace character

How to extract text within a string of text

Categories

Resources