Convert String Characters to Text - excel

I'm trying to convert a large amount of data into a written description of the text. Ex. Convert YYYY####### to "4 digit year, 7 numeric digits" and YYMMDD-#### to "2 digit year, 2 digit month, 2 digit day, hyphen, 4 numeric digits"
The constant characters are Y, M, D, #, - and X (X is for non-defined alpha characters). There are some defined alpha characters (Y, M, D and X are never used for anything other than Year, Month, Day and Alpha respectively) that are used, ie (RP-YYYY#####) where I want to try to capture those (anything other than the constant characters) and state them as they are. So the written text for RP-YYYY##### would be "RP, hypen, 4 digit year, 5 numeric digits"
I'm able to get a count of each character using the Len and Replace methods, however I'm struggling to figure out how to produce the written text in the correct order, or to capture non constant characters like RP and state them as is.
Any help would be much appreciated!
Sub getcharacters()
Dim casenumber As String
casenumber = Range("A1")
InitialCount = Len(casenumber)
YearDigits = Len(casenumber) - Len(Replace(casenumber, "Y", ""))
MonthDigits = Len(casenumber) - Len(Replace(casenumber, "MM", ""))
DayDigits = Len(casenumber) - Len(Replace(casenumber, "DD", ""))
NumberDigits = Len(casenumber) - Len(Replace(casenumber, "#", ""))
AlphaDigits = Len(casenumber) - Len(Replace(casenumber, "X", ""))
HyphenDigits = Len(casenumber) - Len(Replace(casenumber, "-", ""))
FinalCount = InitialCount - YearDigits - MonthDigits - DayDigits - Digits - AlphaDigits
If YearDigits = "0" Then WrittenYear = ""
If YearDigits = "2" Then WrittenYear = "Two digit year"
If YearDigits = "4" Then WrittenYear = "Four digit year"
If MonthDigits = "0" Then WrittenMonth = "" Else WrittenMonth = "Two digit month"
If DayDigits = "0" Then WrittenDay = "" Else WrittenDay = "Two digit day"
If NumberDigits = "0" Then WrittenDigits = "" Else WrittenDigits = NumberDigits & " digits"
If AlphaDigits = "0" Then WrittenAlpha = "" Else WrittenAlpha = AlphaDigits & " alpha characters"
WrittenCaseNumber = WrittenYear & WrittenMonth & WrittenDay & WrittenDigits & WrittenAlpha
End Sub

You may use Regex for the matching and replacing part along with some helper functions to convert digits to words, etc. I admit it's a bit ugly but it does the trick.
First, add the following two references to your project:
Microsoft VBScript Regular Expressions 5.5
Microsoft Scripting Runtime
Second, add the following code to a Module:
Option Explicit
Private DictAlphaCharacters As Scripting.Dictionary
Private Sub InitializeDictAlphaCharacters()
Set DictAlphaCharacters = New Scripting.Dictionary
DictAlphaCharacters.Add "Y", "digit year"
DictAlphaCharacters.Add "M", "digit month"
DictAlphaCharacters.Add "D", "digit day"
DictAlphaCharacters.Add "#", "numeric digits"
End Sub
Public Function DescribeThis(s As String) As String
If DictAlphaCharacters Is Nothing Then InitializeDictAlphaCharacters
Dim tmpStr As String: tmpStr = s
Dim regEx As New RegExp
regEx.Global = True
Dim matches As MatchCollection
Dim m As Match
Dim k As Variant ' Dictionary key.
Dim alpha As String ' The corresponding sentence for an alpha char.
Dim l As Integer ' Length of the matched string (consecutive alpha chars).
Dim w As String ' The corresponding word of a digit.
For Each k In DictAlphaCharacters.Keys
alpha = DictAlphaCharacters.Item(k)
regEx.Pattern = k & "{1,9}"
Set matches = regEx.Execute(tmpStr)
For Each m In matches
l = m.Length
w = DigitToWord(l)
' Pattern ex. = "([^Y])?,?Y{2}(?!Y)"
regEx.Pattern = "([^" & k & "])?,?" & k & "{" & l & "}(?!" & k & ")"
' Replacement example: "$1,Two digit year,"
tmpStr = regEx.Replace(tmpStr, "$1," & w & " " & alpha & ",")
Next
Next
regEx.Pattern = ",?-,?"
tmpStr = regEx.Replace(tmpStr, ",hyphen,")
regEx.Pattern = "^,+|,+$"
DescribeThis = regEx.Replace(tmpStr, "")
End Function
Public Function DigitToWord(d As Integer) As String
Select Case d
Case 1: DigitToWord = "One"
Case 2: DigitToWord = "Two"
Case 3: DigitToWord = "Three"
Case 4: DigitToWord = "Four"
Case 5: DigitToWord = "Five"
Case 6: DigitToWord = "Six"
Case 7: DigitToWord = "Seven"
Case 8: DigitToWord = "Eight"
Case 9: DigitToWord = "Nine"
End Select
End Function
Usage:
Sub Test()
Debug.Print DescribeThis("YYYY#######")
Debug.Print DescribeThis("YYMMDD-####")
Debug.Print DescribeThis("RP-YYYY#####")
Debug.Print DescribeThis("YYYMMM-YYMM")
End Sub
Output:
Four digit year,Seven numeric digits
Two digit year,Two digit month,Two digit day,hyphen,Four numeric digits
RP,hyphen,Four digit year,Five numeric digits
Three digit year,Three digit month,hyphen,Two digit year,Two digit month

This seems to accomplish what you want.
As written it assumes that all of the "like" characters in the set of [YMD#] are contiguous. If groups of Y's, for example, could repeat in different parts of the string, we just need to change the charCnt function.
Option Explicit
'set reference to Microsoft Scripting Runtime
Function convStr(S As String) As String
Dim myDict As Dictionary
Dim sRes() As String
Dim I As Long
Dim CH As String
Set myDict = New Dictionary
myDict.CompareMode = TextCompare
myDict.Add "Y", "digit year"
myDict.Add "M", "digit month"
myDict.Add "D", "digit day"
myDict.Add "#", "numeric digits"
myDict.Add "-", "hyphen"
ReDim sRes(0)
For I = 1 To Len(S)
CH = Mid(S, I, 1)
If myDict.Exists(CH) Then
sRes(UBound(sRes)) = IIf(CH <> "-", charCnt(S, CH) & " ", "") & myDict(CH)
I = I + charCnt(S, CH)
Else
Do While Not myDict.Exists(CH)
sRes(UBound(sRes)) = sRes(UBound(sRes)) & CH
I = I + 1
CH = Mid(S, I, 1)
Loop
End If
I = I - 1
ReDim Preserve sRes(UBound(sRes) + 1)
Next I
ReDim Preserve sRes(UBound(sRes) - 1)
convStr = Join(sRes, ", ")
End Function
Function charCnt(S As String, CH As String) As Long
Dim startChar As Long
startChar = InStr(S, CH)
If startChar > 0 Then
charCnt = Len(S) - Len(Replace(S, CH, ""))
Else
charCnt = 0
End If
End Function

Related

Extract only numbers with 8 digits and does not followed by contain characters (. , #)

I am using a function to extract numbers from string with conditions that number with 8 digits and does not contain characters (. , #).
It works with 8 digits , but if the number is followed by characters (. , #) ,it also extract that number and that not required.
This my string 11111111 12345678.1 11111112 11111113 and the expected output is 11111111 11111112 11111113 without 12345678.1.
I tried to use negative Lookahead \d{8}(?!.,#) but it is useless.
Thanks all for your help.
Function Find8Numbers(st As String) As Variant
Dim regex As New RegExp
Dim matches As MatchCollection, mch As match
regex.Pattern = "\d{8}" 'Look for variable length numbers only
regex.IgnoreCase = True
regex.Global = True
regex.MultiLine = True
If (regex.Test(st) = True) Then
Set matches = regex.Execute(st) 'Execute search
For Each mch In matches
Find8Numbers = LTrim(Find8Numbers) & " " & mch.value
Next
End If
End Function
In line with your question and current attempt, you could indeed use regex:
Function Find8Numbers(st As String) As String
With CreateObject("vbscript.regexp")
.Pattern = "(?:^|\s)(\d{8})(?![.,#\d])"
.Global = True
If .Test(st) Then
Set Matches = .Execute(st)
For Each mch In Matches
Find8Numbers = LTrim(Find8Numbers & " " & mch.submatches(0))
Next
End If
End With
End Function
Invoke through:
Sub Test()
Dim s As String: s = "11111111 12345678.1 11111112 11111113"
Debug.Print Find8Numbers(s)
End Sub
Prints:
11111111 11111112 11111113
Pattern used:
(?:^|\s)(\d{8})(?![.,#\d])
See an online demo
(?:^|\s) - No lookbehind in VBA thus used a non-capture group to match start-line anchor or whitespace;
(\d{8}) - Exactly 8 digits in capture group;
(?![.,#\d]) - Negative lookahead to assert position isn't followed by any of given characters including digits.
I'm not sure you need Regex for what is a reasonably simple pattern. You could just go with a VBA solution:
Public Function Find8Numbers(str As String) As String
Dim c As String, c1 As String
Dim i As Long, numStart As Long
Dim isNumSeq As Boolean
Dim result As String
If Len(str) < 8 Then Exit Function
For i = 1 To Len(str)
c = Mid(str, i, 1)
If i = Len(str) Then
c1 = ""
Else
c1 = Mid(str, i + 1, 1)
End If
If c >= "0" And c <= "9" Then
If isNumSeq Then
If i - numStart + 1 = 8 Then
If c1 <> "." And c1 <> "," And c1 <> "#" Then
If result <> "" Then result = result & " "
result = result & Mid(str, numStart, 8)
isNumSeq = False
End If
End If
Else
If i > Len(str) - 8 + 1 Then Exit For
isNumSeq = True
numStart = i
End If
Else
isNumSeq = False
End If
Next
Find8Numbers = result
End Function

Converting a string with date in US format to UK format

I am trying to convert a string, which has a date in US format into UK format.
The following code seems to be hit or miss when it comes to a date that is single digits for both the day and the month:
X = 3
Do While strTimeStamp = 0
If InStr(WS2.Cells(lRow, lCol), "TIMESTAMP") <> 0 Then
strHPCStats = Split(WS2.Cells(lRow, lCol), " ")
'strHPCStats(X) = Mid(strHPCStats(X), 4, 6)
re.Pattern = "^(\d{2})(\d{2})(\d{4})$"
strHPCStats(X) = re.Replace(strHPCStats(X), "$3/$2/$1")
strHPCStats(X) = Format$(strHPCStats(X), "dd/mmm/yyyy")
strTimeStamp = strHPCStats(X)
WS2.Cells(lRow, lCol).EntireRow.Delete
lRow = lRow - 1
Else
WS2.Cells(lRow, lCol).EntireRow.Delete
lRow = lRow - 1
End If
lRow = lRow + 1
Loop
The typical string:
4:19:17 (application) TIMESTAMP 3/13/2022
The string where it is having trouble:
5:36:32 (cameo) TIMESTAMP 4/1/2022
d{2} will look for exactly 2 digits, so if your date has a month (or day) with only 1 digit, the regex doesn't match.
If you want to specify 1 or 2 digits, you can for example use d{1,2}, so the statement would be
e.Pattern = "^(\d{1,2})(\d{1,2})(\d{4})$"
Details: How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
There's no need to use regular expressions, given your expected data.
Just look for two slashes in a space-separated string:
Function us2ukDate(S As String) As Date
Dim v, w, x
v = Split(S, " ")
For Each w In v
If (Len(w) - Len(Replace(w, "/", ""))) = 2 Then
x = Split(w, "/")
us2ukDate = DateSerial(x(2), x(0), x(1))
Exit Function
End If
Next w
End Function
testing example
If, instead of just returning the date, you want to change the format within the string, you could do something like:
Sub convertStrings()
Const d1 = "4:19:17 (application) TIMESTAMP 3/13/2022"
Const d2 = "5:36:32 (cameo) TIMESTAMP 4/1/2022"
Dim sParts
sParts = Split(d1, " ")
sParts(UBound(sParts)) = Format(us2ukDate(sParts(UBound(sParts))), "dd-mmm-yyyy")
Debug.Print Join(sParts, " ")
sParts = Split(d2, " ")
sParts(UBound(sParts)) = Format(us2ukDate(sParts(UBound(sParts))), "dd-mmm-yyyy")
Debug.Print Join(sParts, " ")
End Sub

Generating regular expression in Excel for strings

I have a huge list of strings where the I am trying to generate a regular expression in an automated way. The strings are pretty simple and I would like to generate regular expressions using a formula or vba code. From the list of strings, here is the following legend:
& - Any UPPERCASE character (A-Z)
# - Any digits (0-9)
_ - Space (/s)
- - Dash
For example, the regular expression generated for the following strings:
Policy Number Policy Digits Regular Expression
####&&###### 12 ^\d{4}[A-Z]{2}\d{6}$
####&_###### 11 ^\d{4}[A-Z]{1}\s\d{6}$
ACPBP&&########## 17 ^[ACPBP]{5}[A-Z]{2}\d{10}$
ACPBA&########## or ACPBA&&########## 16 or 17 ^[ACPBA]{5}[A-Z]{1,2}\d{10}$
########## 10 ^\d{10}$
09############ 14 ^[09]{2}\d{12}$
A&&######, A&&#######, or A&&######## 9, 10 or 11 ^[A]{1}[A-Z]{2}\d{6,8}$
&&&####, &&&#####, or &&&###### 7, 8, or 9 ^[A-Z]{3}\d{4,6}$
09-##########-## 14 ^[09]{2}-\d{10}-\d{2}$
Is there some existing code that is available to generate regular expressions for a huge list of strings? What are some of the hints or tips that I can use to build a regular expression string? Thanks in advance.
There is no existing code, but try this:
Option Explicit
Option Compare Text 'to handle upper and lower case "or"
'Set reference to Microsoft Scripting Runtime
' or use Late Binding if distributing this
Function createRePattern(sPolicyNum As String) As String
Dim dCode As Dictionary, dReg As Dictionary
Dim I As Long, sReg As String, s As String
Dim v, sPN
v = Replace(sPolicyNum, "or", ",")
v = Split(v, ",")
Set dCode = New Dictionary
dCode.Add Key:="&", Item:="[A-Z]"
dCode.Add Key:="#", Item:="\d"
dCode.Add Key:="_", Item:="\s"
For Each sPN In v
sPN = Trim(sPN)
If Not sPN = "" Then
Set dReg = New Dictionary
For I = 1 To Len(sPN)
s = Mid(sPN, I, 1)
If Not dCode.Exists(s) Then dCode.Add s, s
If dReg.Exists(s) Then
dReg(s) = dReg(s) + 1
Else
If dReg.Count = 1 Then
dReg.Add s, 1
s = Mid(sPN, I - 1, 1)
sReg = sReg & dCode(s) & IIf(dReg(s) > 1, "{" & dReg(s) & "}", "")
dReg.Remove s
Else
dReg.Add s, 1
End If
End If
Next I
'Last Entry in Regex
s = Right(sPN, 1)
sReg = sReg & dCode(s) & IIf(dReg(s) > 1, "{" & dReg(s) & "}", "") & "|"
End If
Next sPN
s = Left(sReg, Len(sReg) - 1)
'Non-capturing group added if alternation present
If InStr(s, "|") = 0 Then
sReg = "^" & s & "$"
Else
sReg = "^(?:" & Left(sReg, Len(sReg) - 1) & ")$"
End If
createRePattern = sReg
End Function
Note
As written, there are limitations in that you cannot reference the literal strings:
#, &, _, , or
Generate regex patterns without dictionary
In addition to Ron's valid solution an alternative using no dictionary:
Option Explicit ' declaration head of code module
Function generateRePattern(ByVal s As String) As String
'[0]definitions & declarations
Const Pipe As String = "|"
Dim curSymbol$: curSymbol = "" ' current symbol (start value)
Dim lngth As Long: lngth = Len(s) ' current string length
Dim ii As Long: ii = 0 ' group index (start value)
Dim n As Long ' repetition counter
ReDim tmp(1 To lngth) ' provide for sufficient temp items
'[1](optional) Pipe replacement for "or" and commas
s = Replace(Replace(Replace(s, " or ", Pipe), " ", ""), ",", Pipe)
'[2]analyze string item s
Dim pos As Long ' current character position
For pos = 1 To lngth ' check each character
Dim curChar As String
curChar = Mid(s, pos, 1) ' define current character
If curChar <> curSymbol Then ' start new group
'a) change repetition counter in old group pattern
If ii > 0 Then tmp(ii) = Replace(tmp(ii), "n", n)
'b) increment group counter & get pattern via help function
ii = ii + 1: tmp(ii) = getPattern(curChar) ' << getPattern
'c) start new repetition counter & group symbol
n = 1: curSymbol = curChar
Else
n = n + 1 ' increment current repetition counter
End If
Next pos
'd) change last repetition counter
tmp(ii) = Replace(tmp(ii), "n", n)
ReDim Preserve tmp(1 To ii) '
'[3]return function result
generateRePattern = "^(?:" & Replace(Join(tmp, ""), "{1}", "") & ")$"
End Function
Help function getPattern()
Function getPattern(curChar) As String
'Purpose: return general pattern based on current character
'a) definitions
Const Pipe As String = "|"
Dim symbols: symbols = Split("&|#|_", Pipe)
Dim patterns: patterns = Split("[A-Z]{n}|\d{n}|\s", Pipe)
'b) match character position within symbols
Dim pos: pos = Application.Match(curChar, symbols, 0)
'c) return pattern
If IsError(pos) Then
getPattern = curChar
Else
getPattern = patterns(pos - 1)
End If
End Function

Matching substrings and counting their occurrences to produce a brief sentence

I've the following Excel data:
A B C
+ ------------ ------------- -----------------
1 | WORD WORD MIX MATCH TEXT RESULT
2 | somewordsome emsomordsowe ...
3 | anotherword somethingelse ...
4 | ... ... ...
I'd like to:
Firstly, get an array, say ArrayOfGroups, by splitting the string in the A2 cell in unique groups of 2 to 12 adjacent chars (note: 2 is the minimum number of chars to form a group; 12 is the total number of the word's chars) i.e. the groups of 2 chars would be so, om, me, ew, wo, or, rd, ds (note: the last so, om and me groups are excluded because they are repeated); the groups of 3 chars would be som, ome, mew, ewo, wor, ord, rds, dso (last som and ome excluded); the groups of 4 chars would be some, omew, mewo, ewor, word, ords, rdso, dsom; ... and so on until the full string somewordsome.
Then, iterate the above-mentioned ArrayOfGroups to check if each of its element is a substring of the B2 cell and return a new array, say ArrayOfMatches, containing all the elements (the characters "group names") that are substrings of B2 and the number of occurrences found in B2.
Finally, output in the C2 cell a sentence built using the ArrayOfMatches data that says something like this:
2 matches for so, 1 match for som and rd
Probably there are other and better approaches to compute the above sentence that is the final result wanted. Maybe I need to use a User Defined Function... but I never made it.
Is there someone that could give help?
May try something like this
Code edited to avoid counting for same substring found multiple times.
Sub test2()
Dim Xstr As String, Ystr As String
Xstr = "somewordsome"
Ystr = "emsomordsowe"
MsgBox Xmatch2(Xstr, Ystr)
End Sub
Function Xmatch2(Xstr As String, Ystr As String) As String
Dim XSubStr As String, YSubStr As String
Dim xLn As Integer, yLn As Integer
Dim XArr As Variant, LnSubStr As Integer
Dim Rslt As String, Cnt As Integer
Dim Xrr() As Variant, Xcnt As Integer, Chk As Boolean
Rslt = "'"
xLn = Len(Xstr)
yLn = Len(Ystr)
For LnSubStr = 2 To xLn 'length of substring
Xcnt = 0
ReDim XArr(1 To 1)
For Y = 1 To xLn
XSubStr = ""
Xcnt = Xcnt + 1
ReDim Preserve XArr(1 To Xcnt)
If Y + LnSubStr - 1 <= xLn Then XSubStr = Mid(Xstr, Y, LnSubStr)
XArr(Xcnt) = XSubStr
Chk = False
For i = 1 To Xcnt - 1
If XArr(i) = XSubStr Then
Chk = True
Exit For
End If
Next
If XSubStr <> "" And Chk = False Then
Cnt = 0
ReDim Preserve XArr(1 To Xcnt)
For Z = 1 To yLn
YSubStr = ""
If Z + LnSubStr - 1 <= yLn Then YSubStr = Mid(Ystr, Z, LnSubStr)
If YSubStr = XSubStr Then Cnt = Cnt + 1
Next
If Cnt > 0 Then Rslt = Rslt & Cnt & " Matches for " & XSubStr & ","
End If
Next
Next
Debug.Print Rslt
Xmatch2 = Rslt
End Function

Excel reverse REPLACE

I am working on some software that cleans up data before sending it into another system. The data comes from all around the world and contains a variety of characters that have to be replaced. For example ‘, : ; #
The system that accepts the parsed data has very strict character set. It allows
the letters A to Z (upper case only)
the numerals 0 to 9
the special characters / -. Space < =
The data arrives in Excel spreadsheets so I have written the following code in a visual basic macro.
fhl_str contains the data to be cleansed
fhl_str = Replace(fhl_str, ",", " ")
fhl_str = Replace(fhl_str, "'", " ")
fhl_str = Replace(fhl_str, ":", " ")
fhl_str = Replace(fhl_str, ";", " ")
fhl_str = ucase(fhl_str)
Now, each time a new unwanted character arrives we have to add a new line of code. e.g. fhl_str = Replace(fhl_str, "#", " ")
My question is
Could I reverse the logic so that the macro looks for A to Z and 0 to 9 and deletes anything else. That way my code would be future proof for new unwanted characters.
Thanks
If you want to replace bad characters with a single space:
Sub KeepOnlyTheGood()
Dim i As Long, L As Long, v As String, CH As String
Dim r As Range
For Each r In Selection
t = ""
v = r.Value
L = Len(v)
For i = 1 To L
CH = Mid(v, i, 1)
If CH Like "[0-9A-Z]" Or CH = "/" Or CH = "-" Or CH = "." Or CH = " " Or CH = "<" Or CH = "=" Then
t = t & CH
Else
t = t & " "
End If
Next i
r.Value = t
Next r
End Sub
Here's some VBA that will do it if you find regex difficult to understand. It uses the ASCII code to determine the only characters to allow. If your scope changes you can modify the ASCII numbers in the Case statement.
Public Function RemoveSpecial(s As String) As String
Dim sResult As String
Dim nIndex As Integer
s = UCase$(s)
For nIndex = 1 To Len(s)
Select Case Asc(Mid$(s, nIndex, 1))
Case 65 To 90, 45 To 57, 32, 60 To 61
sResult = sResult & Mid$(s, nIndex, 1)
Case Else
sResult = sResult & " "
End Select
Next
RemoveSpecial = sResult
End Function
Usage:
Debug.Print RemoveSpecial("TeSt<>=.##")
or something like:
Range("A1") = RemoveSpecial("TeSt<>=.##")
ASCII Codes

Resources