Extract email id of specific domain extensions - excel

I need to extract email id from each row of specific domain extensions like .com .net .org everything else should be ignored. Below is the sample data of two rows.
.#.3,.#.1601466914865855,.#.,.#.null,.#.,abc#xyz.com,abc#xyz.net,abc#xyz.org,null.val#.#.,.##,abc#xyz.jpb,abc#xyz.xls,abc#xyz.321
.#.3,.#.1601466914865855,.#.,.#.null,.#.,123#hjk.com,123#hjk.net,123#hjk.org,null.val#.#.,.##,abc#xyz.jpb,abc#xyz.xls,abc#xyz.321
Whatever the first valid extension email matches is enough even though there are multiple id's only one email id is enough per row. Below is the sample desired result.
I believe this can be done with custom formula with regex but I can't wrap my head around it. I am using Desktop MS Excel latest version.

If your email addresses are relatively simple, you can use this regex:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b
In VBA:
Option Explicit
Function extrEmail(S As String) As String
Dim RE As Object, MC As Object
Const sPat As String = "\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b"
Set RE = CreateObject("vbscript.regexp")
With RE
.Pattern = sPat
.ignorecase = True
.Global = False
.MultiLine = True
If .test(S) = True Then
Set MC = .Execute(S)
extrEmail = MC(0)
End If
End With
End Function
Matching an email address can become very complicated, and a regex that follows all the rules is extraordinarily complex and long. But this one is relatively simple, and might work for your needs.
Explanation of Regex
Emailaddress1
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,}\b
Options: Case insensitive; ^$ match at line breaks
Assert position at a word boundary \b
Match a single character present in the list below [A-Z0-9._%+-]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
A character in the range between “A” and “Z” A-Z
A character in the range between “0” and “9” 0-9
A single character from the list “._%+” ._%+
The literal character “-” -
Match the character “#” literally #
Match a single character present in the list below [A-Z0-9.-]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
A character in the range between “A” and “Z” A-Z
A character in the range between “0” and “9” 0-9
The literal character “.” .
The literal character “-” -
Match the character “.” literally \.
Match a single character in the range between “A” and “Z” [A-Z]{2,}
Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) {2,}
Assert position at a word boundary \b
Created with RegexBuddy
EDIT: To match only specific domains, merely replace the part of the regex that matches domains with a group of pipe-separated domain names.
eg
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.(?:com|net|org)\b

Related

Excel: Find and Replace without also grabbing the beginning of another word

I'm currently working on shortening a large excel sheet using Find/Replace. I'm finding all instances of words like ", Inc.", ", Co." " LLC", etc. and replacing them with nothing (aka removing them). The problem I am having is that I'm unable to do similar searches for " Inc", ", Inc", ", Co", etc. and remove them because it will also remove them the beginnings of words like ", Inc"orporated, and ", Co"mpany.
Is there a blank character or something I can do in VBA that would allow me to just find/replace items with nothing after what I'm finding (I.e. finding ", Co" without also catching ", Co"rporated)?
In VBA you can use Regular Expressions to ensure that there are "word boundaries" before and after the abbreviation you are trying to remove. You can also remove extraneous spaces that might appear, depending on the original string.
Function remAbbrevs(S As String, ParamArray abbrevs()) As String
Dim RE As Object
Dim sPat As String
sPat = "\s*\b(?:" & Join(abbrevs, "|") & ")\b\.?"
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.ignorecase = False
.Pattern = sPat
remAbbrevs = .Replace(S, "")
End With
End Function
For arguments to this function you can enter a series of abbreviations. The function creates an appropriate regex to use.
For example in the below, I entered:
=remAbbrevs(A1,"Inc","Corp")
and filled down:
Explanation of the regex:
remAbbrevs
\s*\b(?:Inc|Corp)\b\.?
Options: Case sensitive
Match a single character that is a “whitespace character” \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Assert position at a word boundary \b
Match the regular expression below (?:Inc|Corp)
Match this alternative Inc
Match the character string “Inc” literally Inc
Or match this alternative Corp
Match the character string “Corp” literally Corp
Assert position at a word boundary \b
Match the character “.” literally \.?
Between zero and one times, as many times as possible, giving back as needed (greedy) ?
Created with RegexBuddy

Use VBA to remove leading and trailing blank spaces but keeping blanks within a string

I'm trying to clean and format some set of data obtained from an accounting system and I have been able to create VBA code to use TRIM or CLEAN functions in the specific column ranges.
The thing is that I need to keep the blank spaces within the strings (can be 2, 3 or more blanks) but still remove the leading/trailing spaces and the mentioned functions reduce the inner spaces to 1. This does not work for me as the data is used as a key to match other information in further steps of the process. Bare in mind that leading/trailing blanks can be the result of space bar key, any other character that appears as a blank or even contains line breaks, so again, I want all of these removed but inner blanks. Strings can be made of alphanumeric characters.
I'm using this in a Private Sub (code is execute via a click in a button placed in the worksheet).
Dim rng1a As Range
Dim Area1a As Range
Set rng1a = Range("F2:F35001")
For Each Area1a In rng1a.Areas
Area1a.NumberFormat = "#"
Area1a.Value = Evaluate("IF(ROW(" & Area1a.Address & "),CLEAN(TRIM(" & Area1a.Address & ")))")
Next Area1a
Example (in range F2:F35001):
Original: Sample Text for Review. *(there are blanks after the string)
Result:Sample Text for Review.
Desired:Sample Text for Review.
I made some research for a couple of weeks and haven't been able to find a solution that keeps the inner blanks "as is" and avoid as much as possible duplicate question in the forum. Thanks in advance for the help.
You can do this with regular expressions:
Option Explicit
Function trimWhiteSpace(s As String) As String
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = "^\s*(\S.*\S)\s*"
trimWhiteSpace = .Replace(s, "$1")
End With
End Function
Explanation of the Regex
Trim leading and trailing white space
^\s*(\S.*\S)\s*
Options: Case sensitive; ^$ match at line breaks
Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed, line feed, line separator, paragraph separator) ^
Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the regex below and capture its match into backreference number 1 (\S.*\S)
Match a single character that is NOT a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) \S
Match any single character that is NOT a line break character (line feed) .*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character that is NOT a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) \S
Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
$1
Insert the text that was last matched by capturing group number 1 $1
Created with RegexBuddy
On the other hand, if you want to avoid regular expressions, and if your only leading/trailing "white-space" characters are space, tab and linefeed, AND if the only "internal" white space characters are the space, you could use:
Function trimWhiteSpace(s As String) As String
trimWhiteSpace = Trim(Replace(Replace([a1], vbLf, ""), vbTab, ""))
End Function
Note that the VBA Trim function (unlike the worksheet function), only removes leading and trailing spaces, and leaves internal spaces unchanged. But this won't work if you have tab's within the string that need to be preserved.
Either of the above can be incorporated into your macro.
Have you tried using the LTRIM function to remove leading spaces then RTRIM to remove the trailing ones which will leave the internal ones intact?
From your description you don't expect TAB characters or Carriage Returns in the middle of your strings so you could just do a replace for them:
strSource = Replace(strSource, vbTab, "")
strSource = Replace(strSource, vbCrLf, " ")

Removing Whole Numbers from an Alphanumeric String

I'm having trouble finding a way to remove floating integers from a cell without removing numbers attached to the end of my string. Could I get some help as to how to approach this issue?
For example, in the image attached, instead of:
john123 456 hamilton, I want:
john123 hamilton
This can be done using regular expressions. You will match on the data you want to remove, then replace this data with an empty string.
Since you didn't provide any code, all I can do you for is provide you with a function that you can implement into your own project. This function can be used in VBA or as a worksheet function, such as =ReplaceFloatingIntegers(A1).
You will need to add a reference to Microsoft VBScript Regular Expressions 5.5 by going to Tools, References in the VBE menu.
Function ReplaceFloatingIntegers(Byval inputString As String) As String
With New RegExp
.Global = True
.MultiLine = True
.Pattern = "(\b\d+\b\s?)"
If .Test(inputString) Then
ReplaceFloatingIntegers = .Replace(inputString, "")
Else
ReplaceFloatingIntegers = inputString
End If
End With
End Function
Breaking down the pattern
( ... ) This is a capturing group. Anything captured in this group will be able to be replaced with the .Replace() function.
\b This is a word boundary. We use this because we want to test from the edge to edge of any 'words' (which includes words that contain only digits in our case).
\d+\b This will match any digit (\d), one to unlimited + times, to the next word boundary\b
\s? will match a single whitespace character, but it's optional ? if this character exists
You can look at this personalized Regex101 page to see how this matches your data. Anything matched here is replaced with an empty string.

How to get string in between two characters in excel/spreadsheet

I have this string
Weiss,Emery/Ap #519-8997 Quam. Street/Hawaiian Gardens,IN - 79589|10/13/2010
how do I get the Hawaiian Gardens only?
I already tried Using some
=mid(left(A1,find("/",A1)-1),find(",",A1)+1,len(A1))
it gives me emery instead
If there are always two slashes before the string you want to extract, based onTyler M's answer you can use this
=MID(E1,
FIND("~",SUBSTITUTE(E1,"/","~",2))+1,
FIND(",",RIGHT(E1,LEN(E1)-FIND("~",SUBSTITUTE(E1,"/","~",2))))-1
)
This substitutes the second occurence of / with a character which normally would not occur in the address, thus making it findable.
Was your intention to also include Google Spreadsheets (looking at your title)? If so,you can use the REGEXEXTRACT() function. For example in B1
=REGEXEXTRACT(A1,"\/([\w\s]*)\,")
In Excel you could build a UDF using this regex rule like so (as an example):
Function REGEXEXTRACT(S As String, PTRN As String) As String
'We will get the last possible match in your string...
Dim regex As Object
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = PTRN
.Global = True
End With
Set matches = regex.Execute(S)
For Each Match In matches
If Match.SubMatches.Count > 0 Then
For Each subMatch In Match.SubMatches
REGEXEXTRACT = subMatch
Next subMatch
End If
Next Match
End Function
Call the function in B1 like so:
=REGEXEXTRACT(A1,"\/([\w\s]*)\,")

Search for a specific number (in inch) where the number isn't part of a larger expression

I want to use an Excel formula that returns the correct index in the cell when the text in the cell contains the term '2"' (two inch). This is possible with the search function.
The catch is that I only want to find instances where it's actually '2"', not cases where you have other expressions such as '1/2"' or '12"'. See the image below for an example to clarify where search works and where it doesn't.
I think a VBA solution using Regular Expressions will be easiest in order to be able to return measurements like 1 1/2".
To enter this User Defined Function (UDF), alt-F11 opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.
To use this User Defined Function (UDF), enter a formula like
=FindMeasure(A1,$E$1)
in some cell, where E1 contains a value like 2" or 1 1/2"
Option Explicit
Function FindMeasure(sSearch As String, ByVal sMeasure As String)
Dim RE As Object, MC As Object, SM As Variant
Dim sPat As String
sPat = "\D(\s+)" & sMeasure & "|^" & sMeasure
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = False
.MultiLine = True
.Pattern = sPat
End With
If RE.test(sSearch) = True Then
Set MC = RE.Execute(sSearch)
SM = MC(0).submatches(0)
FindMeasure = MC(0).firstindex + Len(SM) + IIf(Len(SM) > 0, 2, 1)
Else
FindMeasure = 0
End If
End Function
EDIT: Reviewing my answer reveals that under certain circumstances, incorrect results will be returned.
If there is a "word" preceding the measurement which ends with a digit, the routine will fail to recognize the measurement. This can be avoided by ensuring that there is at least one non-digit in the string preceding the measurement (by modifying the regex). However, if the entire word consists of digits, the measurement will not be recognized.
If the line starts with a SPACE, the measurement will not be recognized. This can be corrected by modifying both the code and the regex to account for that possibility.
If the cell containing the measurement, or the cell containing the string, is blank, then the result will be incorrect. This can be avoided by testing for those conditions, by modifying the code.
Modified Code
Option Explicit
Function FindMeasure(sSearch As String, ByVal sMeasure As String)
Dim RE As Object, MC As Object, SM As Variant
Dim sPat As String
sPat = "(\S*\D\S*\s+)" & sMeasure & "|(^\s*)" & sMeasure
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = False
.MultiLine = True
.Pattern = sPat
End With
If RE.test(sSearch) = True And _
Len(sSearch) > 0 And _
Len(sMeasure) > 0 Then
Set MC = RE.Execute(sSearch)
SM = MC(0).submatches(0) & MC(0).submatches(1)
FindMeasure = MC(0).firstindex + Len(SM) + 1
Else
FindMeasure = 0
End If
End Function
Explanation of Regex with sMeasure = 2"
(\S*\D\S*\s+)2"|(^\s*)2"
(\S*\D\S*\s+)2"|(^\s*)2"
Options: Case insensitive; ^$ match at line breaks
Match this alternative (\S*\D\S*\s+)2"
Match the regex below and capture its match into backreference number 1 (\S*\D\S*\s+)
Match a single character that is NOT a “whitespace character” \S*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character that is NOT a “digit” \D
Match a single character that is NOT a “whitespace character” \S*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match a single character that is a “whitespace character” \s+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
Match the character string “2"” literally 2"
Or match this alternative (^\s*)2"
Match the regex below and capture its match into backreference number 2 (^\s*)
Assert position at the beginning of a line ^
Match a single character that is a “whitespace character” \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the character string “2"” literally 2"
Created with RegexBuddy
use:
=SEARCH(" 2"""," " & A1)-1
There are three tiny tricks associated with this formula:
we search for {space} 2 "
we place a blank at the start of the string
we account for the blank by subtracting one from the position
EDIT#1:
This may be better. With data in A3 and the string in B2 try:
=IFERROR(IF(LEFT(A3,2)=$B$2,1,SEARCH(" " & $B$2, A3)+1),0)
If I'm reading this right the requirements are:
1) if it starts with 2" (followed by a space)
2) there is 2" in the middle of the string (with a space on each side)
3) the string ends with 2" (preceeded by a space) it should be"OK", otherwise zero
If those are the requirements this formula should work:
=IF(OR(LEFT(A2,3)="2"" ",ISNUMBER(SEARCH(" 2"" ",A2)),RIGHT(A2,3)=" 2"""),"OK",0)
-- or you may have to use this for the second row depends on your requirements --
=IF(OR(LEFT(A2,2)="2""",ISNUMBER(SEARCH(" 2"" ",A2)),RIGHT(A2,3)=" 2"""),"OK",0)

Resources