Truncating Text To Full Words Based On Character Limit - Excel

Truncating Text To Full Words Based On Character Limit - Excel - excel

I'm working with some data (DataSet#1) which has a text field truncated using some unconventional logic:
If "Service Type Description" is > 60 Characters, Trim the name down to < 60 characters, but only full words
My problem is that I need to format some other data (DataSet#2) in excel to match this logic which is being applied on the back-end on our reporting server (outside my control). No one can seem to find a list of all the potential truncated descriptions either.
Dataset#1 is live and can be re-pulled with updated data at any time, so I need to create a template that allows me to pull in information from the list in DataSet#2 (which currently has the full length descriptions) into any copy of Dataset#1 based on the trimmed Service Type Description in DataSet#1.
Example:
The following is the full product name, and the product name in my DataSet#2:
"FNMA 1025 Small Residential Income Property Appraisal & FNMA 216 Addendum" (73 characters, including spaces)
Simply trimming this text to be <60 characters (59) would yield:
"FNMA 1025 Small Residential Income Property Appraisal & FNM"
However, this same product, in the main data (DataSet#1) is named as follows:
"FNMA 1025 Small Residential Income Property Appraisal & " (56 characters, 8 "words", including &)
The logic on the back-end for DataSet#1 has trimmed the full product name to under 60 characters, but retains only full words (removes the "FNM" partial word).
Ideally I have to be able to take a list that has the full description name - and apply logic in Excel (or VBA) that will yield the same result as the trimmed data from the other dataset - which then allows me to pull information from dataset #2 (full product names) into dataset#1 based on the service type description.

You could use something like so
Function truncate_string(strInput As String, Optional lngChars As Long = 60)
Dim lngCharInstance As Long
lngCharInstance = Len(strInput)
While lngCharInstance > lngChars
lngCharInstance = InStrRev(strInput, " ", _
IIf(lngCharInstance >= Len(strInput), _
Len(strInput), lngCharInstance - 1))
Wend
truncate_string = Mid(strInput, 1, lngCharInstance)
End Function
This would be called like so
truncate_string("FNMA 1025 Small Residential Income Property Appraisal & FNMA 216 Addendum")
and would return as follows
FNMA 1025 Small Residential Income Property Appraisal &
or like so for, 30 chars for example
truncate_string("FNMA 1025 Small Residential Income Property Appraisal & FNMA 216 Addendum",30)
which gives
FNMA 1025 Small Residential
Hope this helps, as there is a loop in there, i'd look at the possibilities of any potential for infinite loops.

You can use Regular Expressions for this.
Option Explicit
Function trimLength(S As String, Optional Length As Long = 60) As String
Dim RE As Object, MC As Object
Dim sPat As String
sPat = "^.{1," & Length - 1 & "}(?=\s|$)"
If Len(S) > 60 Then
Set RE = CreateObject("vbscript.regexp")
With RE
.Pattern = sPat
.MultiLine = True
Set MC = .Execute(S)
trimLength = MC(0)
End With
Else
trimLength = S
End If
End Function
Note that, in accord with your question, we subtract one from the desired length.
Explanation of the Regex
Trim Length to Whole Word
^.{1,59}(?=\s|$)
Options: ^$ match at line breaks
Assert position at the beginning of a line ^
Match any single character that is NOT a line break character .{1,59}
Between one and 59 times, as many times as possible, giving back as needed (greedy) {1,59}
Assert that the regex below can be matched starting at this position (positive lookahead) (?=\s|$)
Match this alternative \s
Match a single character that is a “whitespace character” \s
Or match this alternative $
Assert position at the end of a line $
Created with RegexBuddy

Related

How to match accented characters but not tab

I'm trying to match the company name in this string delimited with tabs.
Below table does not have tabs when you copy it, but I have replaced tabs with two spaces, which I assume will work fine for testing.
1025164 HERBEX IBERIA, S.L.U. KY01 4600292091
1016379 DRISCOLL´S OF EUROPE B.V. KY01 4600322589
1008809 LANDGARD NORD OBST & GEMÜSE GM KY01 4600347315
1008835 C.A.S.I. : COOPERATIVA PROVINC KY01 4600348112
1019258 SYDGRÖNT EKONOMISK FÖRENING KY02 4600343422
(The second column of the above, between 7 digit number and KY0 above)
In real life the columns are not always in the same order since it's a user preference.
I just took a few examples but names could also include /éèáà()´, pretty much anything (sadly).
I found another question here Concrete Javascript Regex for Accented Characters (Diacritics)
When I use the regex patterns in that thread, example: "\t([A-zÀ-ÿ0-9\s\.\,\_\-\'\&]+)\t" (I know some characters are still missing) to match between two tabs it becomes greedy and matches the whole line.
Is there any pattern that could match any character in a company name between tabs (or two spaces as the example above)?

Instead of returning a matched part, I matched everything and replaced it with the 1st capture group. Hope it helps.
Sub Test()
Dim str As String: str = "1025164" & vbTab & "HERBEX IBERIA, S.L.U." & vbTab & "KY01" & vbTab & "4600292091"
With CreateObject("vbscript.regexp")
.Global = True
.Pattern = "(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)"
Debug.Print .Replace(str, "$1")
End With
End Sub
Have a look at this online demo to test the pattern:
(?:^|\t) - Match either start line anchor or a tab. Unfortunately the VBA-regex object does not support lookbehinds.
(?: - Open a non-capture group to start matching all parts you don't want to capture first:
\d+ - match 1+ digits;
| - Or:
KY\d+ - Match "KY" followed by 1+ digits;
| - Or:
([^\t]+) - nest a capture group to capture 1+ non-tabs.
) - Close non-capture group.
(?=\t|$) - Positive lookahead to assert captured text is followed by either a tab or end-line anchor.

I would use a different attempt using the split-command. The following code assumes that you have Tabs as separator and that the company name is found if the column is not numeric (only digits) and does not start with 'KY'.
Function getCompanyName(line As String) As String
Const separator = vbTab ' Replace with " " if you need that.
Dim tokens() As String, i As Integer
tokens = Split(line, separator)
For i = 0 To UBound(tokens)
If Not IsNumeric(tokens(i)) And Left(tokens(i) <> "KY") Then
getCompanyName = tokens(i)
Exit Function
End If
Next
End Function

Capturing the submatches after the second instance of a string variable

How would one capture two numeric submatches after the second instance of a string/number?
I have a # that changes from .txt file to .txt file. It is captured in a variable called "Total" which I declared as a string. The string contains numbers, in the format of 123, 456,789.23 or 123, 456.01. This number appears about 3 times within the .txt file, and I have written a RegEx pattern that is able to capture the first instance of this number and its submatches.
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
The .txt file portion I am trying to capture may appear as
123,456,789.38
2.180
251.517
OR
123,456,789.38 2.180 251.517
I want to capture 2.180 and 251.517.
The first instance includes the words "Number of: " in front of it, and I tried to make the pattern avert from the ":" before it by writing:
regex.Pattern = "[^:\s]" & Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
It still picks up this first instance and the numbers after that first instance. The second instance does not have any defining words before it, just a blank line such as the one below:
123,456,789.38
2.180
251.517
Additional information:
Dim regex As Object: Set regex = CreateObject("vbscript.regexp")
regex.Pattern = Total & "\s*([\d+\.\d*])\s*([\d+\.\d*])\s*"
Dim MCS As Object
Set MCS = regex.Execute(myText)
Dim Total As String: Total = MCS(0).submatches(0)
Dim submatch1 As String: submatch1 = MCS(0).Submatches(1)
Dim submatch2 As String: submatch2 = MCS(0).Submatches(2)
where mytext is the contents of the .txt file entirely as a string.
There are also words and numbers between the different instances of the variable "Total", such as
Number of: 123,456,789.38
Text Here Text Here
Number Here
123,456,789.38
2.180
251.517
I am also not sure how much/many text/numbers there will be between the first two instances of 123,456,789.38, so I am trying to think of how to work this to be flexible.
When I mention second instance, I mean that the number 123,456,789.38 (which is the variable named "Total") appears three times in the document. I want to capture the two submatches that appear after that number. However, since there are three times it appears, I want to capture the two submatches that appear after the second time that 123,456,789.38 pops up.
Link to the text file.
https://regex101.com/r/4hPtY3/6
Output:
submatch1 = 2.180
submatch2 = 251.517
Currently, it is capturing
submatch1 = 97
submatch2 = 5
with the pattern:
regex.Pattern = Replace(Total, ".", "\.") & "\s*([.\d]+)\s*([.\d]+)"

You can use
regex.Pattern = "^(?:[\s\S]*?" & Replace(Total, ".", "\.") & "){2}\D+([.\d]+)\s*([.\d]+)"
Here is the regex demo where you can see that
As expected.
Details
^ - start of string
(?:[\s\S]*?123,456\.78){2} - two occurrences of any 0+ chars as few as possible and then 123,456.78 string
\D+ - 1 or more non-digit chars
([.\d]+) - Group 1: one or more dots or digits
\s* - 0+ whitespaces
([.\d]+) - Group 2: one or more dots or digits.

regex for Excel to remove all but specific symbols after a specific symbol?

I have stings like this which are addresses, e.g.:
P.O. Box 422, E-commerce park<br>Vredenberg<br><br><br>Curaçao
Adelgatan 21<br>Malmö<br><br>211 22<br>Sweden
Läntinen Pitkäkatu 35 A 15<br>Turku<br><br>20100<br>Finland
I am interested in Country only. Country always comes last after a <br> tag.
Note, that there can be several such tags preceding this last value (e.g. 1st example string).
Is there a good way to do a formula may ve along those lines:
Identify end of string
Loop a character back until one reaches ">" character
Cut everything else (including the ">" encountered)

You don't need RegEx to do this if it's always the last part of the string.
You can get it with String modifiers doing
Sub Test()
Dim str As String, str1 As String, str2 As String
Dim Countries As String
str = "P.O. Box 422, E-commerce park<br>Vredenberg<br><br><br>Curaçao"
str1 = "Adelgatan 21<br>Malmö<br><br>211 22<br>Sweden"
str2 = "La¨ntinen Pitka¨katu 35 A 15<br>Turku<br><br>20100<br>Finland"
Countries = Right(str, Len(str) - InStrRev(str, "<br>") - 3)
Countries = Countries + vbNewLine + Right(str1, Len(str1) - InStrRev(str1, "<br>") - 3)
Countries = Countries + vbNewLine + Right(str2, Len(str2) - InStrRev(str2, "<br>") - 3)
MsgBox Countries
End Sub
Obviously this will need to be updated for how your data set is stored. You can loop through the dataset and use the string modifier on each line

A formula works too. If a string in A1, write in B1:
=TRIM(RIGHT(SUBSTITUTE(A1,"<br>",REPT(" ",100)),100))
Modified using an approach taken from here:
https://exceljet.net/formula/get-last-word

Split, escaping certain splits

I have a cell that contains multiple questions and answers and is organised like a CSV. So to get all these questions and answers separated a simple split using the comma as the delimiter should separate this easily.
Unfortunately, there are some values that use the comma as the decimal separator. Is there a way to escape the split for those occurrences?
Fortunately, my data can be split using ", " as separator, but if this wouldn't be the case, would there still be a solution besides manually replacing the decimal delimiter from a comma to a dot?
Example:
"Price: 0,09,Quantity: 12,Sold: Yes"
Using Split("Price: 0,09,Quantity: 12,Sold: Yes",",") would yield:
Price: 0
09
Quantity: 12
Sold: Yes

One possibility, given this test data, is to loop through the array after splitting, and whenever there's no : in the string, add this entry to the previous one.
The function that does this might look like this:
Public Function CleanUpSeparator(celldata As String) As String()
Dim ret() As String
Dim tmp() As String
Dim i As Integer, j As Integer
tmp = Split(celldata, ",")
For i = 0 To UBound(tmp)
If InStr(1, tmp(i), ":") < 1 Then
' Put this value on the previous line, and restore the comma
tmp(i - 1) = tmp(i - 1) & "," & tmp(i)
tmp(i) = ""
End If
Next i
j = 0
ReDim ret(j)
For i = 0 To UBound(tmp)
If tmp(i) <> "" Then
ret(j) = tmp(i)
j = j + 1
ReDim Preserve ret(j)
End If
Next i
ReDim Preserve ret(j - 1)
CleanUpSeparator = ret
End Function
Note that there's room for improvement by making the separator caharacters : and , into parameters, for instance.

I spent the last 24 hours or so puzzling over what I THINK is a completely analogous problem, so I'll share my solution here. Forgive me if I'm wrong about the applicability of my solution to this question. :-)
My Problem: I have a SharePoint list in which teachers (I'm an elementary school technology specialist) enter end-of-year award certificates for me to print. Teachers can enter multiple students' names for a given award, separating each name using a comma. I have a VBA macro in Access that turns each name into a separate record for mail merging. Okay, I lied. That was more of a story. HERE'S the problem: How can teachers add a student name like Hank Williams, Jr. (note the comma) without having the comma cause "Jr." to be interpreted as a separate student in my macro?
The full contents of the (SharePoint exported to Excel) field "Students" are stored within the macro in a variable called strStudentsBeforeSplit, and this string is eventually split with this statement:
strStudents = Split(strStudentsBeforeSplit, ",", -1, vbTextCompare)
So there's the problem, really. The Split function is using a comma as a separator, but poor student Hank Williams, Jr. has a comma in his name. What to do?
I spent a long time trying to figure out how to escape the comma. If this is possible, I never figured it out.
Lots of forum posts suggested using a different character as the separator. That's okay, I guess, but here's the solution I came up with:
Replace only the special commas preceding "Jr" with a different, uncommon character BEFORE the Split function runs.
Swap back to the commas after Split runs.
That's really the end of my post, but here are the lines from my macro that accomplish step 1. This may or may not be of interest because it really just deals with the minutiae of making the swap. Note that the code handles several different (mostly wrong) ways my teachers might type the "Jr" part of the name.
'Dealing with the comma before Jr. This will handle ", Jr." and ", Jr" and " Jr." and " Jr".
'Replaces the comma with ~ because commas are used to separate fields in Split function below.
'Will swap ~ back to comma later in UpdateQ_Comma_for_Jr query.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, "Jr", "~ Jr.") 'Every Jr gets this treatment regardless of what else is around it.
'Note that because of previous Replace functions a few lines prior, the space between the comma and Jr will have been removed. This adds it back.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, ",~ Jr", "~ Jr") 'If teacher had added a comma, strip it.
strStudentsBeforeSplit = Replace(strStudentsBeforeSplit, " ~ Jr", "~ Jr") 'In cases when teacher added Jr but no comma, remove the (now extra)...
'...space that was before Jr.

Lookup customer type by the meaningful part of the customer name and set prioritize

Is there any way excel 2010 can lookup customer type by using meaningful part of customer name?
Example, The customer name is Littleton's Valley Market, but the list I am trying to look up the customer type the customer names are formatted little different such as <Littletons Valley MKT #2807 or/and Littleton Valley.
Some customer can be listed under multiple customer types, how can excel tell me what which customer and can I set excel to pull primary or secondary type?

Re #1. Fails on the leading < (if belongs!) and any other extraneous prefix but this may be rare or non-existent so:
=INDEX(G:G,MATCH(LEFT(A1,6)&"*",F:F,0))
or similar may catch enough to be useful. This looks at the first six characters but can be adjusted to suit, though unfortunately only once at a time. Assumes the mismatches are in ColumnA (eg A1 for the formula above) and that the correct names are in ColumnF with the required type in the corresponding row of ColumnG.
On a large scale Fuzzy Lookup may be helpful.
Since with a VBA tag Soundex matching and Levenshtein distance may be of interest.
Re #2 If secondary type is in ColumnH, again in matching row, then adjust G:G above to H:H.

pnuts gives a good answer re: Fuzzy Lookup, Soundex matching, etc. Quick and dirty way I've handled this before:
Function isNameLike(nameSearch As String, nameMatch As String) As Boolean
On Error GoTo ErrorHandler
If InStr(1, invalidChars(nameSearch), invalidChars(nameMatch), vbTextCompare) > 0 Then isNameLike = True
Exit Function
ErrorHandler:
isNameLike = False
End Function
Function invalidChars(strIn As String) As String
Dim i As Long
Dim sIn As String
Dim sOut As String
sOut = ""
On Error GoTo ErrorHandler
For i = 1 To Len(strIn)
sIn = Mid(strIn, i, 1)
If InStr(1, " 1234567890~`!##$%^&*()_-+={}|[]\:'<>?,./" & Chr(34), sIn, vbTextCompare) = 0 Then sOut = sOut & sIn
Next i
invalidChars = sOut
Exit Function
ErrorHandler:
invalidChars = strIn
End Function
Then I can call isNameLike from code, or use it as a formula in a worksheet. Note that you still have to supply the "significant" part of the customer name you're looking for.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Truncating Text To Full Words Based On Character Limit - Excel - excel

Related

How to match accented characters but not tab

Capturing the submatches after the second instance of a string variable

regex for Excel to remove all but specific symbols after a specific symbol?

Split, escaping certain splits

Lookup customer type by the meaningful part of the customer name and set prioritize

Categories

Resources