I am required to extract passages of text from the contents of Excel cells in which the originator has essentially done a manual Track Changes using Strikethrough font. The passages are identifiable with certain character patterns, but I have to ignore Strikethrough characters to see them. The Strikethrough characters do not appear in regular locations within each cell, so are essentially randomly dispersed with normal font text.
I have achieved my goal using VBA for Excel, but the solution is extremely (and impracticably) slow. Having searched this site and the wider web for answers, it seems the use of the Characters object is to blame.
So my question is: has anyone found a way of parsing such text that does not involve the Characters object?
The sub I wrote to do the parsing is too long to post here, but following is some test code which uses the Characters object in a similar way. This takes 60 s to parse a cell with 3000 characters in it. At that speed, it would take 50 hours to process the entire spreadsheet I've been given.
Private Sub FindLineBreakChars(TargetCell As Excel.Range)
Dim n As Integer
Dim ch As String
Dim st As Boolean
If TargetCell.Cells.Count <> 1 Then
Call MsgBox("Error: more or less than one cell in range specified.")
Else
If IsEmpty(TargetCell.Value) Then
Call MsgBox("Error: target cell is empty.")
Else
If Len(TargetCell.Value) = 0 Then
Call MsgBox("Error: target cell contains an empty string.")
Else
'Parse the characters in the cell one by one.
For n = 1 To TargetCell.Characters.Count
ch = TargetCell.Characters(n, 1).Text
st = TargetCell.Characters(n, 1).Font.Strikethrough
If ch = vbCr Then
Debug.Print "#" & n & ": Carriage Return (vbCr)" & ", strikethrough = " & st & vbCrLf
ElseIf ch = vbLf Then
Debug.Print "#" & n & ": Line Feed (vbLf)" & ", strikethrough = " & st & vbCrLf
End If
Next n
End If
End If
End If
End Sub
You're right, the access to Characters is very slow, so your goal should be to reduce it's usage as much as possible.
I don't understand your requirement details, but the following code should get you an idea how you could speed up the code. It reads the content of a cell only once, split the text into separate lines, calculates the position of the single linefeed characters and look at that position for the formatting. As far as I know there is no way to access the formatting all at once, but now the access to the characters-object is reduced to one per line:
With TargetCell
Dim lines() As String, lineNo As Integer, textLen As Long
lines = Split(.Value2, vbLf)
textLen = Len(lines(0)) + 1
For lineNo = 1 To UBound(lines)
Dim st
st = .Characters(textLen, 1).Font.Strikethrough
Debug.Print "#" & textLen & ": LineFeed (vbLf) strikethrough = " & st
textLen = textLen + Len(lines(lineNo)) + 1
Next lineNo
End With
To my knowledge, Excel stores Linebreaks in a cell using just the LineFeed character, so the code is checking only that.
This might meet your performance needs: it calls a function which parses the XML representation of the cell content, removes the struck-out sections, and returns the remaining text.
It will be much faster than looping over Characters
Sub Tester()
Debug.Print NoStrikeThrough(Range("A1"))
End Sub
'Needs a reference to Microsoft XML, v6.0
' in your VBA Project references
Function NoStrikeThrough(c As Range) '
Dim doc As New MSXML2.DOMDocument60, rv As String
Dim x As MSXML2.IXMLDOMNode, s As MSXML2.IXMLDOMNode
'need to add some namespaces
doc.SetProperty "SelectionNamespaces", _
"xmlns:ss='urn:schemas-microsoft-com:office:spreadsheet' " & _
"xmlns:ht='http://www.w3.org/TR/REC-html40'"
doc.LoadXML c.Value(11) 'cell data as XML
Set x = doc.SelectSingleNode("//ss:Data")'<< cell content
Set s = x.SelectSingleNode("//ht:S") '<< strikethrough
Do While Not s Is Nothing
Debug.Print "Struck:", s.Text
x.RemoveChild s '<< remove struck section
Set s = x.SelectSingleNode("//ht:S")
Loop
NoStrikeThrough = doc.Text
End Function
EDIT: here's another way to go at it, by breaking up the text into "blocks" and checking each block to see if it has any strikethrough. How much faster this is than going character-by-character may depend on block size and the distribution of struck-out text in each cell.
Function NoStrikeThrough2(c As Range)
Const BLOCK As Long = 50
Dim L As Long, i As Long, n As Long, pos As Long, x As Long
Dim rv As String, s As String, v
L = Len(c.Value)
n = Application.Ceiling(L / BLOCK, 1) 'how many blocks to check
pos = 1 'block start position
For i = 1 To n
v = c.Characters(pos, BLOCK).Font.Strikethrough
If IsNull(v) Then
'if strikethough is "mixed" in this block - parse out
' character-by-character
s = ""
For x = pos To pos + BLOCK
If Not c.Characters(x, 1).Font.Strikethrough Then
s = s & c.Characters(x, 1).Text
End If
Next x
rv = rv & s
ElseIf v = False Then
'no strikethrough - take the whole block
rv = rv & c.Characters(pos, BLOCK).Text
End If
pos = pos + BLOCK 'next block position.
Next i
NoStrikeThrough2 = rv
End Function
EDIT2: if you need to make sure all newline characters are not struck out before processing the cell -
Sub ClearParaStrikes(c As Range)
Dim pos As Long
pos = InStr(pos + 1, c.Value, vbLf)
Do While pos > 0
Debug.Print "vbLf at " & pos
c.Characters(pos, 1).Font.Strikethrough = False
pos = InStr(pos + 1, c.Value, vbLf)
Loop
End Sub
Related
Excel workbook consist of 10,000 rows and 25 columns and take 15 mins to complete this process. i need to reduce the runtime to complete this process into less than 1 min. kindly help me out from this situtaion.
For Each cl In rng.SpecialCells(2)
For i = Len(cl.Value) To 1 Step -1
If cl.Characters(i, 1).Font.Strikethrough Then
cl.Characters(i, 1).Delete
End If
Next i
Next cl
Very fast approach via xlRangeValueXMLSpreadsheet Value
Using the relatively unknown xlRangeValueXMLSpreadsheet Value, also referred to as ►.Value(11) solves the question by a very simple string replacement (though the xml string handling can reveal to be very complicated under special conditions).
This approach (quickly tested for 10000 rows) seems to be up to 90 times faster as Tim's valid solution refining the original code, but lasting 14 minutes :-)
Sub RemoveStrThr(rng As Range, Optional colOffset As Long = 1)
'a) Get range data as xml spreadsheet value
Dim xmls As String: xmls = rng.Value(xlRangeValueXMLSpreadsheet) ' //alternatively: xmls = rng.Value(11)
'b) find start position of body
Dim pos As Long: pos = InStr(xmls, "<Worksheet ")
'c) define xml spreadsheet parts and remove <S>-node sections in body
Dim head As String: head = Left(xmls, pos - 1)
Dim body As String: body = Mid(xmls, pos)
'remove strike throughs
Dim results: results = Split(Replace(body, "</S>", "^<S>"), "<S>")
results = Filter(results, "^", False) ' negative filtering of special char "^"
body = Join(results, "")
'd) write cleaned range back
rng.Offset(0, colOffset).Value(11) = head & body
End Sub
Example call
Sub TestRemove()
Application.ScreenUpdating = False
Dim t As Double
t = Timer
RemoveStrThr Sheet1.Range("A2:Z10000"), 27 ' << change to your needs
Debug.Print "done", Format(Timer - t, "0.00 secs")
Application.ScreenUpdating = True
End Sub
Help reference links
Excel XlRangeValueDataType enumeration
Excel Range Value
Addendum (due to #Tim' valuable comment)
Note that if the whole cell content should be struck out then this will not remove the struck-out content: there are no <S> or </S> tags in the element, since the strikethrough is applied via a Style rule (via the xml spreadsheet value head).
To meet this eventuality
"...you could add a second step using something like Application.FindFormat.Font.Strikethrough = True: rng.Replace What:="*", Replacement:="", LookAt:=xlPart, SearchOrder:=xlByRows, SearchFormat:=True: Application.FindFormat.Clear to take care of those cells."
Any use of the Characters collection tends to be kind of slow, so the best you can do (beyond turning off screenupdating) is get some minor improvements by (eg) ignoring cells with no strikethrough, checking for cases where all content is struck through, and batching your calls to Delete.
Sub tester()
Dim t
Range("C1:C3").Copy Range("A1:A999") 'creating some dummy cell values (no/mixed/all ST)
Application.ScreenUpdating = False
t = Timer
RemoveStrikeThrough Range("A1:A999")
Debug.Print "done", Timer - t
End Sub
Sub RemoveStrikeThrough(rng As Range)
Dim cl As Range, hasST, i As Long, pos As Long, st As Boolean
For Each cl In rng.Cells
'only process cells which have any strikethrough style applied
' hasST will be False (no ST), True (all ST) or Null (mixed ST)
hasST = cl.Font.Strikethrough
If TypeName(hasST) = "Boolean" Then
If hasST Then
cl.ClearContents 'all text is struck out, so clear the cell
Else
'Debug.Print "No strikethrough", cl.Address
End If
Else
'mixed - do char by char
For i = Len(cl.Value) To 1 Step -1
If cl.Characters(i, 1).Font.Strikethrough Then
If Not st Then 'new run?
st = True
pos = i
End If
Else
If st Then 'previous run?
cl.Characters(i + 1, pos - i).Delete
st = False
End If
End If
Next i
'remove last strikethough if any
If st Then cl.Characters(1, pos).Delete
st = False 'reset this
End If
Next cl
End Sub
How am I able to remove the comma without removing the strikethrough format
Example: C418, C419, C420 , C421, C422, C423, C424
Expected Result: C418 C419 C420 C421 C422 C423 C424
Final Result: C418, C419 C420 C421 C422 C423 C424
I am checking to see if that cell contain a strikethrough. By using the Function I am able to detect it. But once I try to remove the comma by using the replace function and replace comma with a blank. The format for the strikethrough will be remove causing the function not to work which will result in a different outcome.
I will like to use the space delimiter to match with the other cell so that I can split the cell value afterwards
If HasStrikethrough(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB)) = True Then
BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value = Replace(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value, ",", "")
BOMCk.Sheets("Filtered RO BOM").Range("G" & LCB).Value = "strike-off"
ElseIf HasStrikethrough(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB)) = False Then
BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value = Replace(BOMCk.Sheets("Filtered RO BOM").Range("B" & LCB).Value, ",", "")
End If
Function HasStrikethrough(rng As Range) As Boolean
Dim i As Long
With rng(1)
For i = 1 To .Characters.Count
If .Characters(i, 1).Font.StrikeThrough Then
HasStrikethrough = True
Exit For
End If
Next i
End With
End Function
Range.Characters only works if the cells value is 255 characters or less.
Range.Characters(i, 1).Delete will delete the commas. Make sure to iterate from the last position to the first position when deleting.
Sub RemoveCommas(ByVal Target As Range)
If Target.Characters.Count > 255 Then
MsgBox "Range.Characters only works with String with 255 or less Characters", vbCritical, "String too long"
Exit Sub
End If
Dim n As Long
For n = Target.Characters.Count To 1 Step -1
If Target.Characters(n, 1).Text = "," Then Target.Characters(n, 1).Delete
Next
End Sub
Alternative via xlRangeValueXMLSpreadsheet Value
The ►.Value(11) approach solves the question by a very simple string replacement (though the xml string handling can reveal to be very complicated in many other cases):
Sub RemoveCommata(rng As Range, Optional colOffset As Long = 1)
'a) Get range data as xml spreadsheet value
Dim xmls As String: xmls = rng.Value(xlRangeValueXMLSpreadsheet) ' //alternatively: xmls = rng.Value(11)
'b) find start position of body
Dim pos As Long: pos = InStr(xmls, "<Worksheet ")
'c) define xml spreadsheet parts and remove commata in body
Dim head As String: head = Left(xmls, pos - 1)
Dim body As String: body = Replace(Mid(xmls, pos), ",", "")
'd) write cleaned range back
rng.Offset(0, colOffset).Value(11) = head & body
End Sub
Help reference links
Excel XlRangeValueDataType enumeration
Excel Range Value
I am trying to replace not each space in a single string with line break. String is taken from specific cell, and looks like:
Now, Im trying to replace each space after abbreviation to line break. The abbreviation can be any, so the best way for precaching which space I intend to replace is like: each space after number and before a letter?
The output I want to get is like:
Below is my code, but it will change every space to line break in cell.
Private Sub Workbook_SheetChange(ByVal Sh As Object, ByVal Target As Range)
On Error GoTo Exitsub
If Not Intersect(Target, .Columns(6)) Is Nothing Then
Application.EnableEvents = False
Target.Value = Replace(Target, " ", Chr(10))
End If
Application.EnableEvents = True
Exitsub:
Application.EnableEvents = True
End Sub
You can try
Target.Value = Replace(Target, "kg ", "kg" & Chr(10))
If you can have other abbreviations like "g" or "t", do something similar for them (maybe in a Sub), just be cautious with the order (replace first "kg", then "g")
Update: If you don't know in advance the possible abbreviations, one attempt is to use regular expressions. I'm not really good with them, but the following routine seems to do:
Function replaceAbbr(s As String) As String
Dim regex As New RegExp
regex.Global = True
regex.Pattern = "([a-z]+) "
replaceAbbr = regex.Replace(s, "$1" & Chr(10))
End Function
The below will replace every 2nd space with a carriage return. For reason unknown to me The worksheet function Replace will work as intended, but the VBA Replace doesnt
This will loop through every character in the defined area, you can change this to whatever you want.
The if statement is broken down as such
(SpaceCount Mod 2) = 0 this part is what enable it to get every 2nd character.
As a side note (SpaceCount Mod 3) = 0 will get the 3rd character and (SpaceCount Mod 2) = 1 will do the first character then every other character
Cells(1, 1).Characters(CountChr, 1).Text = " " is to make sure we are replacing a space, if the users enters something funny that looks like a space but isn't, that's on them
I believe something like this will work as intended for you
Private Sub Workbook_SheetChange(ByVal Sh As Object, ByVal Target As Range)
On Error GoTo Exitsub
Application.EnableEvents = False
For CountChr = 1 To Len(Target.Value)
If Target.Characters(CountChr, 1).Text = " " Then
Dim SpaceCount As Integer
SpaceCount = SpaceCount + 1
If (SpaceCount Mod 2) = 0 Then
Target.Value = WorksheetFunction.Replace(Target.Value, CountChr, 1, Chr(10))
End If
End If
Next CountChr
Application.EnableEvents = True
Exitsub:
Application.EnableEvents = True
End Sub
Identify arbitrary abbreviation first
"abbreviations aren't determined ..."
Knowing the varying abbreviation which, however is the same within each string (here e.g. kg ) actually helps following the initial idea to look at the blanks first: but instead of replacing them all by vbLf or Chr(10), this approach
a) splits the string at this " " delimiter into a zero-based tmp array and immediately identifies the arbitrary abbreviation abbr as second token, i.e. tmp(1)
b) executes a negative filtering to get the numeric data and eventually
c) joins them together using the abbreviation which is known now for the given string.
So you could change your assignment to
'...
Target.Value = repl(Target) ' << calling help function repl()
Possible help function
Function repl(ByVal s As String) As String
'a) split into tokens and identify arbitrary abbreviation
Dim tmp, abbr As String
tmp = Split(s, " "): abbr = tmp(1)
'b) filter out abbreviation
tmp = Filter(tmp, abbr, Include:=False)
'c) return result string
repl = Join(tmp, " " & abbr & vbLf) & abbr
End Function
Edit // responding to FunThomas ' comment
ad a): If there might be missing spaces between number and abbreviation, the above approach could be modified as follows:
Function repl(ByVal s As String) As String
'a) split into tokens and identify arbitrary abbreviation
Dim tmp, abbr As String
tmp = Split(s, " "): abbr = tmp(1)
'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'b) renew splitting via found abbreviation (plus blank)
'~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tmp = Split(s & " ", abbr & " ")
'c) return result string
repl = Join(tmp, abbr & vbLf): repl = Left(repl, Len(repl) - 1)
End Function
ad b): following OP citing e.g. "10 kg 20 kg 30,5kg 15kg 130,5 kg" (and as already remarked above) assumption is made that the abbreviation is the same for all values within one string, but can vary from item to item.
This site has been a veritable treasure chest of answers and ideas to many of my vba problems in the past, but i have not been able to find any concerning what i am sure is for many, if not most, here in this forum a simple task. I have to deal with a lot of xml report files that all have a header string and my problem is how to parse the string for the nuggest i require for my macro.
This is a sample string:
<Function IDREF="TST_RxRccsMatrix_Rx64" Start="2011-04-07T14:21:35.593000+02:00" Status="Success" Tags="SystemSerialNumber:41009" End="2011-04-07T14:29:16.625000+02:00">
I need to extract
- the report type: TST_RxRccsMatrix (length of this string is not constant)
- the start date-time stamp: 2011-04-07T14:21:35.593000+02:00 (length is constant)
- the serial number: 41009 (length is constant)
I have tried methods using Split and InStr and Find but none produce the desired results for all three extractions.
I truely appreciate any help on this!
The old fashion way is to use instr to find beginning. Then use instr to find ending. Then use mid to suck it out.
Begin = instr(1,xmlstring,"IDREF=") + Len("IDREF")
'look for first space after IDREF= in string
End = instr(Begin, xmlstring, " ")
Report = mid(xmlstring, begin, end - begin)
I didn't test it.
But I's split on space, then go through the array splitting on =. That will give you an array of 2 element arrays with value name in (0) and value in (1).
But xml has it's own query language and libraries to access stuff.
This is some code splitting a command line and then splitting 320x200 into 300 and 200.
CmdLine = Command()
A = Split(CmdLine, Chr(32), 2, 1)
B = Split(A(0), "x", 2, 1)
xmlstring = "<Function IDREF=""TST_RxRccsMatrix_Rx64"" Start=""2011-04-07T14:21:35.593000+02:00"" Status=""Success"" Tags=""SystemSerialNumber:41009"" End=""2011-04-07T14:29:16.625000+02:00"">"
Set regEx = New RegExp
regEx.Pattern = "IDREF=""([a-z0-9_]+)"""
regEx.IgnoreCase = True
regEx.Global = True
Set Matches = regEx.Execute(xmlstring)
If Matches.count <> 1 then msgbox "no match or too many"
For Each Match in Matches
Msgbox match.submatches(0)
Next
I answered your qustions. The other person deleted two easier ways of doing it.
Ask Oded to put back my explanation of this code. And to restore the MS tutorial on how to do it with XML DOM objects. I showed FOUR ways.
After some polishing:
Private Sub GetFileInfo()
Dim fso As New FileSystemObject, strText As Variant, i As Integer
Dim X(0 To 2) As String, Y(0 To 2) As String, B, E As Variant
'get header string from xml file
'FName (file name) was ascertained by a previous sub and is made public
Set strText = fso.OpenTextFile(FName, ForReading, False)
'header string is in second (i = 2) line of file
For i = 1 To 2: [A1] = strText.ReadLine: Next: strText.Close: Set fso = Nothing
'User Oded's search and extract routine
X(0) = "IDREF=": X(1) = "Start=": X(2) = "Tags="
For i = LBound(X(), 1) To UBound(X(), 1)
B = InStr(1, [A1], X(i)) + Len(X(i)) + 1 ' + 1 includes trailing " character
E = InStr(B, [A1], " ") - 1 ' - 1 includes leading " character
'required if a search string in X() is at the end of the header which ends with a ">"
If (InStr(B, [A1], " ") - 1) < 0 Then E = InStr(B, [A1], ">")
Y(i) = Mid([A1], B, E - B)
Next
[D1] = "Test = " & Y(0)
[D2] = "Tested on : " & Left(Y(1), 10) & " at " & Mid(Y(1), 12, 8)
[D2] = [D2] & " - " & Y(2)
End Sub
I'm trying to prepare a spreadsheet for a report in excel vba. Unforturnately there are some wierd characters here that need to be replaced. Easy enough, except for this chracter:
¦
I can't seem to be able to paste that character into the editor into a string replace function. When I try, the output is _. I then thought to refer to it by it's Chr code. A quick look up said it was Chr(166). http://www.gtwiki.org/mwiki/?title=VB_Chr_Values
Replace(s, "â€" + Chr(166), "...")
But this is not that character at all (at least on Mac excel). I tried:
For i = 1 To 255
Debug.Print Chr(i)
Next i
And I didn't see this character anywhere. Does anyone know how I can reference this character in vba code in order to replace it?
Not sure if regexp is available for vba-mac, but you could simplify your existing code greatly as below.
Uses a sample Strin
Dim strIn As String
strIn = "1â€1â€x123"
Do While InStr(strIn, "â€") > 0
Mid$(strIn, InStr(strIn, "â€"), 3) = "..."
Loop
Click on a cell containing your miscreant character and run this small macro:
Sub WhatIsIt()
Dim s As String, mesage As String
Dim L As Long
s = ActiveCell.Text
L = Len(s)
For i = 1 To L
ch = Mid(s, i, 1)
cd = Asc(ch)
mesage = mesage & ch & " " & cd & vbCrLf
Next i
MsgBox mesage
End Sub
It should reveal the characters in the cell and their codes.
It's dirty, but here's the workaround that I used to solve this problem. I knew that my issue character was always after "â€", so the idea was to replace the character that came after those 2. I don't really know how to replace a character at a position in a string, so my idea was to covert the string to an array of characters and replace the array at those specific indexes. Here's what it looks like:
Do While InStr(s, "â€") > 1
num2 = InStr(s, "â€")
arr = stringToArray(s)
arr(num2 - 1) = "<~>"
arr(num2) = "<~>"
arr(num2 + 1) = "<~>"
s = Replace(arrayToString(arr), "<~><~><~>", "...")
Loop
...
Function stringToArray(ByVal my_string As String) As Variant
Dim buff() As String
ReDim buff(Len(my_string) - 1)
For i = 1 To Len(my_string)
buff(i - 1) = Mid$(my_string, i, 1)
Next
stringToArray = buff
End Function
Function arrayToString(ByVal arr As Variant) As String
Dim s As String
For Each j In arr
s = s & j
Next j
arrayToString = s
End Function
In practice, what I replaced those indexes with is something that had to be unique but recognizable. Then i can replace my unique characters with whatever I want. There are sure to be edge cases, but for now it gets the job done. stringToArray function pulled from: Split string into array of characters?