I'm looking for a macro (preferably a function) that would take cell contents, split it into separate words, compare them to one another and remove the shorter words.
Here's an image of what I want the output to look like (I need the words that are crossed out removed):
I tried to write a macro myself, but it doesn't work 100% properly because it's not taking the last words and sometimes removes what shouldn't be removed. Also, I have to do this on around 50k cells, so a macro takes a lot of time to run, that's why I'd prefer it to be a function. I guess I shouldn't use the replace function, but I couldn't make anything else work.
Sub clean_words_containing_eachother()
Dim sht1 As Worksheet
Dim LastRow As Long
Dim Cell As Range
Dim cell_value As String
Dim word, word2 As Variant
Set sht1 = ActiveSheet
col = InputBox("Which column do you want to clear?")
LastRow = sht1.Cells(sht1.Rows.Count, col).End(xlUp).Row
Let to_clean = col & "2:" & col & LastRow
For i = 2 To LastRow
For Each Cell In sht1.Range(to_clean)
cell_value = Cell.Value
cell_split = Split(cell_value, " ")
For Each word In cell_split
For Each word2 In cell_split
If word <> word2 Then
If InStr(word2, word) > 0 Then
If Len(word) < Len(word2) Then
word = word & " "
Cell = Replace(Cell, word, " ")
ElseIf Len(word) > Len(word2) Then
word2 = word2 & " "
Cell = Replace(Cell, word2, " ")
End If
End If
End If
Next word2
Next word
Next Cell
Next i
End Sub
Assuming that the retention of the third word in your first example is an error, since books is contained later on in notebooks:
5003886 book books bound case casebound not notebook notebooks office oxford sign signature
and also assuming that you would want to remove duplicate identical words, even if they are not contained subsequently in another word, then we can use a Regular Expression.
The regex will:
Capture each word
look-ahead to see if that word exists later on in the string
if it does, remove it
Since VBA regexes cannot also look-behind, we work-around this limitation by running the regex a second time on the reversed string.
Then remove the extra spaces and we are done.
Option Explicit
Function cleanWords(S As String) As String
Dim RE As Object, MC As Object, M As Object
Dim sTemp As String
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b(\w+)\b(?=.*\1)"
.ignorecase = True
'replace looking forward
sTemp = .Replace(S, "")
' check in reverse
sTemp = .Replace(StrReverse(sTemp), "")
'return to normal
sTemp = StrReverse(sTemp)
'Remove extraneous spaces
cleanWords = WorksheetFunction.Trim(sTemp)
End With
End Function
Limitations
punctuation will not be removed
a "word" is defined as containing only the characters in the class [_A-Za-z0-9] (letters, digits and the underscore).
if any words might be hyphenated, or contain other non-word characters
in the above, they will be treated as two separate words
if you want it treated as a single word, then we might need to change the regex
General steps:
Write cell to array (already working)
for each element (x), go through each element (y) (already working)
if x is in y AND y is longer that x THEN set x to ""
concat array back into string
write string to cell
String/array manipulations are much faster than operations on cells, so this will give you some increase in performance (depending on the amount of words you need to replace for each cell).
The "last word problem" might be that you dont have a space after the last word within your cells, since you only replace word + " " with " ".
Related
I need help with my code that displays an input box and the user inputs a name then the code splits the names and counts the names displaying the following:
Sub ParseName()
Dim Name As String
Dim Count As Integer
Dim Cell As Object
Dim n As Integer
Count = 0
Name = InputBox("Enter First Name, Middle Name, and Last Name")
If Name = "" Then
For Each Cell In Selection
n = InStr(1, Cell.Value, Name)
While n <> 0
Count = Count + 1
n = InStr(n + 1, Cell.Value, Name)
Next Cell
MsgBox Count & " Occurrences of " & Name
End If
End Sub
Absolutely. You could split the names, but it probably wouldn't help. VBA's split doesn't allow you to split on single characters, AFAIK, as other languages might if you split them a specific delimiter. So you could just loop through the characters using MID to see if each letter is a space or not.
There is a way without any splitting or looping. You can just replace the space and get the length of the what's left.
Len(Replace(Name, " ", ""))
where REPLACE just replaces one string with another, in this case replacing all the spaces with nothing, and LEN just counts the characters in a string.
Here's your code rewritten to use this method, with the unnecessary code and variables removed. I would also change the Name variable, since I believe that is a reserved word in VBA. It will let you do it, but you're potentially impacting some existing behavior. For this particular purpose, using a standalone function to get the character count is someh
Sub ParseName()
Dim fullName As String, charCount As Integer
fullName = InputBox("Enter First Name, Middle Name, and Last Name")
If fullName <> "" Then
charCount = Len(Replace(fullName, " ", ""))
MsgBox fullName & " has " & charCount & " characters"
End If
End Sub
Bear in mind, however, that there are plenty of other character codes you might not want to count. Tabs, new lines, any of a number of whitespace characters. Non-character symbols. Things of that nature.
Also, this code does not check that the string even contains letters, or that the user input has three names, or that it is in the format First Middle Last.
I have a sequence of texts that represent customers names, but they do not have spaces between each word/name (for example: JohnWilliamsSmith). However the words can be differentiated because the first letter of each word is uppercase.
So I need to transpose this list of customers strings to their regular format, with spaces between each word. So I want JohnWilliamsSmith to become John Williams Smith. However, I couldn't think of an immediate way to achieve this, and I believe that no combination of Excel formulas can offer this result.
Thus, I think the only solution would be to set up a Macro. It might be a Function to be used as a formula, or a code in module to work the data in a certain range (imagine that the list is in Range ("A2: A100")).
Does anyone have any idea how I can do this?
Function AddSpaces(PValue As String) As String
Dim xOut As String
xOut = VBA.Left(PValue, 1)
For i = 2 To VBA.Len(PValue)
xAsc = VBA.Asc(VBA.Mid(PValue, i, 1))
If xAsc >= 65 And xAsc <= 90 Then
xOut = xOut & " " & VBA.Mid(PValue, i, 1)
Else
xOut = xOut & VBA.Mid(PValue, i, 1)
End If
Next
AddSpaces = xOut
End Function
NB: Use this Function formula is =Addspace(A1).
In addition to #Forty3's comment to your question, the answer on how to use Regular Expressions in VBA is here.
With that being said you are then looking for the regular expression to match John, Williams, Smith which is ([A-Z])([a-z]+.*?)
Dim regex As New RegExp
Dim matches As MatchCollection
Dim match As match
Dim name As String
regex.Global = True
regex.Pattern = "([A-Z])([a-z]+.*?)"
name = "JohnWilliamsSmith"
Set matches = regex.Execute(name)
For Each match In matches
name = Replace(name, match.Value, match.Value + " ")
Next match
name = Trim(name)
This gives me John Williams Smith. Of course, additional coding will be needed to account for cases like WillWilliamsWilliamson.
Let's say I have "Vegas is great" in cell A1. I want to write a formula that looks for the exact word, "gas" in cells. Vegas ≠ gas, but the only search formula I'm finding:
=ISNUMBER(SEARCH("gas",lower(A1))
returns true. Is there anyway to do do exact matching? I'd ideally like it to be non-case sensitive which I believe is satisfied by wrapping A1 in lower().
I believe to correctly cover cases you have to pad spaces before and after the term "gas" and the search term. This will ensure that gas will be found at the beginning or end of a cell, and also prevent it from being found in the middle of any words. Your post does not indicate whether punctuation can exist in the file, but to accomodate punctuation padding spaces around the search will not work correctly, you would have to include the case of " gas. " " gas! " etc to allow for any punctuation specifically. If you are worried about catching values like "gas.cost" or similar you can use the same padding around the punctuation search.
=Or(ISNUMBER(SEARCH(" gas ", " "&A1&" ")),ISNUMBER(SEARCH(" gas. ", " "&A1&" ")))
Is a basic search that should return the word gas by itself, or "gas." By padding a space after "gas." in the search it will find it as the final word in a sentence, or at the end of a cell.
Edit: Dropped a parentheses.
The Find function is case sensitive. The SEARCH function is not. There is no need for the LOWER function if you are using SEARCH.
SEARCH(<find_text>, <within_text>, [optional]<start_num>)
Wrap both the find_text and within_text in spaces and perform your SEARCH.
The formula in B1 is,
=ISNUMBER(SEARCH(" gas ", " "&A1&" "))
Fill down as necessary.
One can also use regular expressions in VBA to accomplish this. In Regular Expressions, "\b" represents a word boundary. A word boundary is defined as the position between a word and a non-word character or the beginning or end of the line. Word characters are [A-Za-z0-9_] (letters, digits, and the underscore). Hence, one can use this UDF. You do need to be aware that words which include non-word characters (e.g. a hyphen) may be treated differently than you expect. And if you are dealing with non-English letters, the Pattern would need to be modified.
But the code is fairly compact.
Option Explicit
Function reFindWord(FindWord As String, SearchText As String, Optional MatchCase As Boolean = False) As Boolean
Dim RE As Object
Dim sPattern As String
Set RE = CreateObject("vbscript.regexp")
sPattern = "\b" & FindWord & "\b"
With RE
.Pattern = sPattern
.ignorecase = Not MatchCase
reFindWord = .test(SearchText)
End With
End Function
I think the only way to cover all possible punctuation surrounding the search word is to create a custom macro function. Use the enhanced split function to tokenize the sentence into an array of words then search the array for a match.
Enhanced split function
https://msdn.microsoft.com/en-us/library/aa155763
How to create custom macro
http://www.wikihow.com/Create-a-User-Defined-Function-in-Microsoft-Excel
Code to create FindEngWord function
Public Function FindEngWord(ByVal TextToSearch As String, ByVal WordToFind As String) As Boolean
Dim WrdArray() As String
Dim text_string As String
Dim isFound As Boolean
isFound = False
text_string = TextToSearch
WrdArray() = Split(text_string)
isFound = False
For i = 0 To UBound(WrdArray)
If LCase(WrdArray(i)) = LCase(WordToFind) Then
isFound = True
End If
Next i
FindEngWord = isFound
End Function
Public Function Split(ByVal InputText As String, _
Optional ByVal Delimiter As String) As Variant
' This function splits the sentence in InputText into
' words and returns a string array of the words. Each
' element of the array contains one word.
' This constant contains punctuation and characters
' that should be filtered from the input string.
Const CHARS = ".!?,;:""'()[]{}"
Dim strReplacedText As String
Dim intIndex As Integer
' Replace tab characters with space characters.
strReplacedText = Trim(Replace(InputText, _
vbTab, " "))
' Filter all specified characters from the string.
For intIndex = 1 To Len(CHARS)
strReplacedText = Trim(Replace(strReplacedText, _
Mid(CHARS, intIndex, 1), " "))
Next intIndex
' Loop until all consecutive space characters are
' replaced by a single space character.
Do While InStr(strReplacedText, " ")
strReplacedText = Replace(strReplacedText, _
" ", " ")
Loop
' Split the sentence into an array of words and return
' the array. If a delimiter is specified, use it.
'MsgBox "String:" & strReplacedText
If Len(Delimiter) = 0 Then
Split = VBA.Split(strReplacedText)
Else
Split = VBA.Split(strReplacedText, Delimiter)
End If
End Function
Can be called from your excel sheet with this.
=FindEngWord(A1,"gas")
I think this will handle all the cases that you are planning to handle:
=OR(ISNUMBER(SEARCH(" gas",LOWER(A1), 1 )), LEFT(A1,3)= "gas")
I added a space before the "gas" in the search. And if the gas was the only word in the cell or the first word in the cell, the right part of this function handles that case.
I have a column containing multiple string values, like a sentence.
in that sentence i want to find one or all alphanumeric values of 10 or more characters containing atleast one - , and put the resulting values in another column.
For example:
the column containing sentence is like:
upgrade 15.07.2010, old No: WI82-01062. User moved to No: WI12-01012 02.04.2012 to a 2 user network.
or
Upgrade from lite 7/6/07, old No: PTX7-89C367EC5052-01211
Ideally I want a column with values like WI82-01062, WI12-01012 for the first example, and PTX7-89C367EC5052-01211 for the second example.
May be searching for the - in the string and finding the first occurrence of blank space at both ends would help, but I do not have any clue how to write that in excel term.
Thanks
You could probably use a regex like this (there may be better patterns!):
Function ExtractData(r As Variant) As String
Static oRE As Object
Dim sTemp As String
Dim n As Long
Dim matches
If oRE Is Nothing Then
Set oRE = CreateObject("vbscript.regexp")
With oRE
.Pattern = "[A-Za-z0-9\-]{10,}"
.Global = True
End With
End If
Set matches = oRE.Execute(r)
If matches.Count > 0 Then
For n = 1 To matches.Count
sTemp = sTemp & ", " & matches(n - 1)
Next n
ExtractData = Mid$(sTemp, 3)
End If
End Function
Ive got some items of text in Excel and id like to capitalise the first letter of each word. However, a lot of text contains the phrase 'IT' and using current capitalisation methods (PROPER) it changes this to 'It'. Is there a way to only capitalise the first letter of each word without DE capitalising the other letters in each word?
Here is a VBA way, add it to a module & =PrefixCaps("A1")
Public Function PrefixCaps(value As String) As String
Dim Words() As String: Words = Split(value, " ")
Dim i As Long
For i = 0 To UBound(Words)
Mid$(Words(i), 1, 1) = UCase$(Mid$(Words(i), 1, 1))
Next
PrefixCaps = Join(Words, " ")
End Function
Used the website http://www.textfixer.com/tools/capitalize-sentences.php and pasted it all in instead
That was all a bit complicated, but I did find if your spreadsheet is pretty simple, you can copy and paste it into word and use it's editing features and then copy and paste that back in to Excel. Worked quite well for me.
Fixes double spaces in the text:
Public Function PrefixCaps(value As String) As String
Dim Words() As String: Words = Split(value, " ")
Dim i As Long
For i = 0 To UBound(Words)
If Len(Words(i)) > 0 Then
Mid$(Words(i), 1, 1) = UCase$(Mid$(Words(i), 1, 1))
End If
Next
PrefixCaps = Join(Words, " ")
End Function`