Regular Expression to remove between characters (Excel VBA)

Regular Expression to remove between characters (Excel VBA) - excel

I have some text I need remove from a string, but I cannot use the normal Replace() because it is a timestamp that will always be changing.
Text to remove <09:35:40> (could be any time, but always the same format <HH:MM:SS>).
These time stamps could occur in multiple locations throughout a string, all need to be removed (replaced with "").
I've seen regular expressions used for similar applications on other posts, but I don't really understand them, so cannot validate which one to use for my use case here.
Edit:
The < and > also need to be removed.
If feedback could be provided as to the -1, that would be great. Help me improve.

You don't need regular expressions I don't think. What about:
Range("A:A").Replace "<??:??:??>", "", xlPart
Use Application.Trim() to deal with double spaces after replacement;
Range("A:A") is just my placeholder for whatever is your range-object.

You could use Split:
Text = "Text to remove <09:35:40> (could be any time, but always the same format)"
NewText = Split(Text, "<")(0) + Split(Text, "> ")(1)
? NewText
Text to remove (could be any time, but always the same format)

Use this regular expression to select all the sub strings in HH:MM:SS format. Then just replace it with empty string ("")
\d\d:\d\d:\d\d
And use this one to remove select it including these characters <>
\<\d\d:\d\d:\d\d\>

In excel, you can find-replace between two characters using the normal Ctrl+F replace, and searching for <*> (in my use case). However, special characters such as * cannot be used in the Replace() function within VBA code. If you want to perform the same operation, replacing anything between characters, I believe a regular expression is a good way of achieving this. This following code works for me in Excel VBA. Note, I am working on a string before it hits the spreadsheet (E.g. I am formatting the string before I print it to any cells).
Dim regExp As ObjectSet
regExp = CreateObject("vbscript.regexp") 'This way, you do not have enable VBScript Regular Expressions 5.5 in the references.
With regExp
.Global = True 'Get all matches.
.Pattern = "\<\d\d:\d\d:\d\d\>" 'Search for any string that contains the pattern entered in quotes. As per the guide Jayadul Shuvo links.
newString = .Replace(prevString, "") 'Replace instances of strings that contain the pattern, with a nothing ""
End With

I had another requirement where "\start " and "\stop " were before and after the timestamps. I was not aware this could have been the case to start with, new information came to light. E.g. "\start <HH:MM:SS> \stop ". This could also be spread across newlines, so I had to consider the removal of the newline as well.
This essentially meant I had to remove a string between two substrings (and be able to remove the newline) and I have used the following pattern:
"\\\bstart\b\s((.|\n)*?)\\\bstop\b\s"
'\ removes the special operation of \ and matches the \
'\b followed by \b matches the whole word between the \b and \b
'\s matches the space
'(.|\n) matches any single character and newlines
'*? matches zero or more occurrences, but as fewer as possible
I would recommend using a regular expression tester such as: https://regexr.com/3hmb6when creating these patterns, it is so helpful! Use the tabs on the bottom right to see what is replaced and to get an explanation of what is going on.
Picture Snippet of the explanation tab for ((.|\n)*?) on the tester website

Related

How can I substitute multiple occurrences of junk strings in Excel?

In the image, 'muddle' is the string containing junk words and the strings I want to extract. There is a fixed list of junk words - the good strings could be literally anything.
You can see this formula has correctly extracted "moo" and "coo", which are not in the list of junk words. The formula is below.
=LET(junkStart,FILTER(SEARCH(Table1[junkwords],Table2[muddle]),ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle]))),
junkEnd,FILTER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1,ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1)),
goodstart,FILTER(junkEnd+1,(junkEnd+1<=LEN(Table2[muddle]))*(ISERROR(XMATCH(junkEnd+1,junkStart)))),
goodend,FILTER(junkStart-1,(junkStart-1>=LEN(1))*(ISERROR(XMATCH(junkStart-1,junkEnd))))+1,
goodchars,goodend-goodstart,
TEXTJOIN("; ",TRUE,MID(Table2[muddle],goodstart,goodchars)))
This works well, but it falls down if a junk word occurs more than once. See below.
The only difference is that 'woo' occurs twice in the second example.
I need a single cell solution. VBA is not an option for me. Using the name manager would be untidy, as would nested formulas.
I've got this far with formulas, which as far as I can tell is the furthest anyone has got with the 'removing multiple words from a cell' problem. I can see the issue - once SEARCH locates the start of a string in a cell, it doesn't go looking for a second occurrence of that string. But I don't know how to find the start of every instance of every string. Can anyone help?

REDUCE is perfect for this:
=REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(m,j,SUBSTITUTE(m,j,"")))
REDUCE starts at the Table2[muddle] value as m then it substitutes the first value of Table1[junkwords] j with "" the outcome becomes the new m which will get a substitute of the second value of j. The result will be the new m, etc.
If you would want to have it comma separated it becomes more complicated, but you can realize by:
=LET(t,SUBSTITUTE(","&REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y,",")))&",",",,",","),
MID(t,2,LEN(t)-3))
This does almost the same as the previous solution, but instead of substituting for blanks it substitutes for , and substitutes all duplicate ,, for singles, so if more substitutes followed eachother it results in one comma. Also, if the first and/or last part got substituted by a single ,, then the result would have a leading and/or trailing ,. This is solved by first adding , in the front and back before substituting the double comma's for singles. the result t is then wrapped in MID, where the first and last character (both being a ,) are removed.
Alternate solution:
=LET(t,REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))),
SUBSTITUTE(TRIM(t)," ",","))
Or in one go if you don't want to use LET:
=SUBSTITUTE(TRIM(REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))))," ",",")
This replaces the junk words with a space. Regardless how many junk words in between words or how many trailing or leading spaces TRIM will fix it to the words separated by one space only. Substituting the spaces for comma gets to your result.

There's no single-formula solution if the junkwords list is not fixed.
Instead, you may choose to use the Substitute() function on each cell of the "Extracted Strings" column to substitute all occurances of each junk word in muddle, i.e. substitute "boo" muddle, then substitute "voo" in the resulted string, replace "noo" in the resulted string...so on. You will get the last cell.
One point to note though, you need to ensure no substring / partial strings problem in the junkwords or you need to define the rules of processing in order for the solution to be "complete". Consider the followings:
junk words = abc, def, cde
muddle = 1234abcdef5678
if you process the string in the above order, you got "12345678"
if you process the junk words in reverse order, you got "123abf5678"

How to make FIND function exact in Excel

I'm using the FIND function in Excel to check whether certain characters appear in a string of characters in a cell.
However, this function doesn't work cleanly for certain special characters. Specifically F̌,B̌, and some others. When F̌ appears in the string, FIND recognizes it as both F and F̌.
Notable that this is not the case for characters such as Ď and Č. FIND works nicely for these.
How can I get the formula to always differentiate between characters with and without the hat? Is there a way to work in EXACT?
Thank you!

It is because F̌ is actually two characters.
=LEN("F̌") returns 2 not 1. The second character is the hat.
If you do:
=UNICHAR(70)&UNICHAR(780)
It will return the F̌
And as such =FIND("F","F̌") will return 1 as it is the first letter of a two character string.
To find "F" in A,B,F̌,F use:
=AGGREGATE(15,7,ROW($ZZ1:INDEX($ZZ:$ZZ,LEN(A1)))/((MID(A1,ROW($ZZ1:INDEX($ZZ:$ZZ,LEN(A1))),1)="F")*(MID(A1,ROW($ZZ2:INDEX($ZZ:$ZZ,LEN(A1)+1)),1)<>UNICHAR(780))),1)
To find either then we need to use IF:
=IF(LEN(A2)=2,FIND(A2,A1),AGGREGATE(15,7,ROW($ZZ$1:INDEX($ZZ:$ZZ,LEN(A1)))/((MID(A1,ROW($ZZ$1:INDEX($ZZ:$ZZ,LEN(A1))),1)=A2)*(MID(A1,ROW($ZZ$2:INDEX($ZZ:$ZZ,LEN(A1)+1)),1)<>UNICHAR(780))),1))

Given that your substrings are comma-separated, look for the character followed by a comma (and add a comma to the end of the string to find the last character).
This allows you to separate multicharacter substrings from uni-character substrings where the latter is contained in the former.
You could use something like:
=FIND("F,",A5&",")
That will find an F in A5, but will not find an F if only F̌ is present

Excel Remove only last characters if they match

I've been trying a few different ways to try and search and replace on excell to remove the last couple of characters.
For instance in one column I have product name S
I want to remove the " S" only.
I have tried some if formulas a swell and not had much luck. I'm assuming there is a simple regex that can be used for the search and replace e.g. " S/" that would just replace if its the last characters and has nothing after it.

Try using the SUBSTITUTE function and replace the letters you want to remove with a unique character/ word / space not appearing anywhere else in the booklet, depending on which part of the string you're trying to remove and what format you're trying to keep
then find and replace ( CTRL +F) that word with the black (space) character
see how to use SUBSTITUTE function here:
https://exceljet.net/excel-functions/excel-substitute-function

Since you are only interested in the end of the string, I don't think you need regex or anything too sophisticated.
If I understand correctly, you want to get the original string (product name S) up until but not including something that appears at the end (S). This means that in your example, you want the 12 leftmost digits: the digits of the original string (14) minus the digits of the pattern (2) - this would give you product name. If the original string does not end with the pattern, you want the original string.
Therefore, I suggest the following:
=IF(RIGHT("original string",LEN("pattern"))="pattern",
LEFT("original string",LEN("original string")-LEN("pattern")),
"original string")
Check these examples:

How can I split a phrase into a new line every x characters on Google Sheets?

I am translating a game, and the game's text box only supports 50 characters max per line. Is there a way to use a formula to split the entire sentence every 50 characters or whole word (49, 48, 47, etc)?
I am currently working with this formula.
=JOIN(CHAR(10),SPLIT(REGEXREPLACE(A1, "(.{50})", "/$1"),"/"))
The problem with this code, is that it splits at exactly 50 characters (one time), and will split in the middle of the word.
So again, my goal is to have it not split on the 50th character IF the 50th character is in the middle of the word, and for the rule to apply for the rest of the lines too because it only applies on the first line.
Please take a look at this test google sheet to get an example of what I am talking about.
If it's impossible to do it on Google Sheets, I don't mind moving to Excel provided I get a functioning code.
For the record, I did ask in Google's product forums 2 days ago, and still haven't received an answer.

=REGEXREPLACE(A1, "(.{1,50})\b", "$1" & CHAR(10))
{50} matches exactly 50 times, but what you need is 50 or less.
\b is word boundary that matches between alphanumeric and non-alphanumeric character.

= REGEXEXTRACT(A1,"(?ism)^"&REPT("([\w\d'\(\),. ]{0,49}\s)", ROUNDUP(LEN(A1)/50,0))&"([\w\d'\(\),. ]{0,49})$")
Tested with various expressions and works as intended. Note that only these characters [a-zA-Z0-9_'(),.] are allowed, Which means - and other characters not mentioned will not work. If you need them, add them inside the REPT expression and finishing regexp formula. Otherwise, This will work perfectly.

You are pretty close. I'm not an expert in Sheets, so not sure if this is the best way, but your Regex is wrong for what you want.
Also, you need to be certain that you don't use a split character that might appear in the phrase itself. However, using CHAR(10) for the replace character allows you to insert LF without going through the JOIN SPLIT sequence.
replace any line feeds, carriage returns and spaces with a single space
Match strings that start with a non-Space character followed by up to 49 more characters which are followed by a space or the end of the string.
replace the capture group with the capturing group followed by the CHAR(10) (and delete the space following).
There will be extra CHAR(10) at the end which you can strip off.
EDIT Regex changed slightly due to a difference in behavior between Google's RE and what I am used to (probably has to do with how a non-backtracking regex works). The problem showed up on your example:
=regexreplace(REGEXREPLACE(REGEXREPLACE(A1 & " ","[\r\n\s]+"," "),"(\S.{0,49})\s","$1" & char(10)),"\n+\z","")

How to replace wildcharacter in CSV

I have below string in csv files
Part Number WP1166496 (AP6005317) replaces 1166496, 1156976.
Expected Output -
Part Number WP1166496 replaces 1166496, 1156976.
I want to replace (AP6005317) this with blanks.
As there are many rows with different values.
So how can I replace this string with brackets to blanks value.

I don't know how to achieve this exactly in Microsoft Excel.
If you look for find and replace feature, most probably you can see option to replace with regular expressions.
Use regular expression option and replace \(.*\) with (simple space). This will solve your problem.
Note : This is tested and verified in LibreOffice Calc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Regular Expression to remove between characters (Excel VBA) - excel

You don't need regular expressions I don't think. What about: Range("A:A").Replace "<??:??:??>", "", xlPart Use Application.Trim() to deal with double spaces after replacement; Range("A:A") is just my placeholder for whatever is your range-object.

You could use Split: Text = "Text to remove <09:35:40> (could be any time, but always the same format)" NewText = Split(Text, "<")(0) + Split(Text, "> ")(1) ? NewText Text to remove (could be any time, but always the same format)

Use this regular expression to select all the sub strings in HH:MM:SS format. Then just replace it with empty string ("") \d\d:\d\d:\d\d And use this one to remove select it including these characters <> \<\d\d:\d\d:\d\d\>

Related

How can I substitute multiple occurrences of junk strings in Excel?

How to make FIND function exact in Excel

Excel Remove only last characters if they match

How can I split a phrase into a new line every x characters on Google Sheets?

How to replace wildcharacter in CSV

Categories

Resources