Nesting Excel formulas to extract e-mail address top-level domain - excel

I want to extract the top-level domain from e-mail addresses using Excel formulas.
I tried it first with concatenating RIGHT(..) Formulas and splitting for the dot. Sadly I do not know how to do this recursively with excel formulas, so I swapped to deleting all characters except the last 4. Now the problem is, when I split my formulas into single cells it works perfectly fine. If I try to use them together, I get only the output of the first inner Formula. How do I fix this?
=RIGHT(B8; LEN(B8)-(LEN(B8)-4))
=RIGHT(BF8;LEN(BF8)-FIND(".";BF8))
These are the formulas split into single cells. And here both together
=RIGHT(RIGHT(B8; LEN(B8)-(LEN(B8)-4));LEN(B8)-FIND(".";B8))
I get the same return value as in the first row from this formula
=RIGHT(B8; LEN(B8)-(LEN(B8)-4))

This =RIGHT(B8; LEN(B8)-(LEN(B8)-4)) is just a uselessly complicated version of =RIGHT(B8; 4).
Substituting this for BF8 in
=RIGHT(BF8;LEN(BF8)-FIND(".";BF8))
yields this
=RIGHT(RIGHT(B8; 4);LEN(RIGHT(B8; 4))-FIND(".";RIGHT(B8; 4)))
which can be simplified as
=RIGHT(RIGHT(B8; 4);4-FIND(".";RIGHT(B8; 4)))
So that's the answer to your question.
But note that this will fail when parsing e-mail addresses whose top-level domain name has more than 3 characters! So it won't work for e.g. test#test.info. Note that top-level domains can be up to 63 characters long!
In this earlier answer, I give a more general solution to this problem, not limited to searching a predetermined number of characters from the right.
=MID(B8;FIND(CHAR(1);SUBSTITUTE(B8;".";CHAR(1);LEN(B8)-LEN(SUBSTITUTE(B8;".";""))))+1;LEN(B8))
returns everything after the last . in the string.

Dot character may appear in left part if e-mail, like: john.johnson#email.com
So, you can't just find "." you need firstly find #, then find dot in right substring.
Tehese are your steps:
1. =FIND("#"; B8)
find # character place
2. =RIGHT(B8;LEN(B8) - FIND("#"; B8))
get substring right from #
3. =FIND(".";RIGHT(B8;LEN(B8) - FIND("#"; B8)))
find "." in step 2 substring
4. =RIGHT(RIGHT(B8;LEN(B8) - FIND("#"; B8)); LEN(RIGHT(B8;LEN(B8) - FIND("#"; B8))) - FIND(".";RIGHT(B8;LEN(B8) - FIND("#"; B8))))
get right(step2; len(step2) - step3)

Related

How can I substitute multiple occurrences of junk strings in Excel?

In the image, 'muddle' is the string containing junk words and the strings I want to extract. There is a fixed list of junk words - the good strings could be literally anything.
You can see this formula has correctly extracted "moo" and "coo", which are not in the list of junk words. The formula is below.
=LET(junkStart,FILTER(SEARCH(Table1[junkwords],Table2[muddle]),ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle]))),
junkEnd,FILTER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1,ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1)),
goodstart,FILTER(junkEnd+1,(junkEnd+1<=LEN(Table2[muddle]))*(ISERROR(XMATCH(junkEnd+1,junkStart)))),
goodend,FILTER(junkStart-1,(junkStart-1>=LEN(1))*(ISERROR(XMATCH(junkStart-1,junkEnd))))+1,
goodchars,goodend-goodstart,
TEXTJOIN("; ",TRUE,MID(Table2[muddle],goodstart,goodchars)))
This works well, but it falls down if a junk word occurs more than once. See below.
The only difference is that 'woo' occurs twice in the second example.
I need a single cell solution. VBA is not an option for me. Using the name manager would be untidy, as would nested formulas.
I've got this far with formulas, which as far as I can tell is the furthest anyone has got with the 'removing multiple words from a cell' problem. I can see the issue - once SEARCH locates the start of a string in a cell, it doesn't go looking for a second occurrence of that string. But I don't know how to find the start of every instance of every string. Can anyone help?
REDUCE is perfect for this:
=REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(m,j,SUBSTITUTE(m,j,"")))
REDUCE starts at the Table2[muddle] value as m then it substitutes the first value of Table1[junkwords] j with "" the outcome becomes the new m which will get a substitute of the second value of j. The result will be the new m, etc.
If you would want to have it comma separated it becomes more complicated, but you can realize by:
=LET(t,SUBSTITUTE(","&REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y,",")))&",",",,",","),
MID(t,2,LEN(t)-3))
This does almost the same as the previous solution, but instead of substituting for blanks it substitutes for , and substitutes all duplicate ,, for singles, so if more substitutes followed eachother it results in one comma. Also, if the first and/or last part got substituted by a single ,, then the result would have a leading and/or trailing ,. This is solved by first adding , in the front and back before substituting the double comma's for singles. the result t is then wrapped in MID, where the first and last character (both being a ,) are removed.
Alternate solution:
=LET(t,REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))),
SUBSTITUTE(TRIM(t)," ",","))
Or in one go if you don't want to use LET:
=SUBSTITUTE(TRIM(REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))))," ",",")
This replaces the junk words with a space. Regardless how many junk words in between words or how many trailing or leading spaces TRIM will fix it to the words separated by one space only. Substituting the spaces for comma gets to your result.
There's no single-formula solution if the junkwords list is not fixed.
Instead, you may choose to use the Substitute() function on each cell of the "Extracted Strings" column to substitute all occurances of each junk word in muddle, i.e. substitute "boo" muddle, then substitute "voo" in the resulted string, replace "noo" in the resulted string...so on. You will get the last cell.
One point to note though, you need to ensure no substring / partial strings problem in the junkwords or you need to define the rules of processing in order for the solution to be "complete". Consider the followings:
junk words = abc, def, cde
muddle = 1234abcdef5678
if you process the string in the above order, you got "12345678"
if you process the junk words in reverse order, you got "123abf5678"

How can I extract an instagram username from a hyperlink to a neighboring column and keep the hyperlink on google sheets?

I'm tasked with going through a long column of instagram profile URLs and creating a new adjacent column comprised of just their usernames. I could theoretically go through the list individually, copy-pasting the part between ".com/" and the last "/" and then hyperlinking each of them, but I feel like there might be a faster way.
I've experimented with formulas trying to extract only the username but to no avail. I also realized formula cells can't be hyperlinked, so I would also need a solution for that. Here what I was trying so far:
Here the input and expected output:
URL
User Name
http://instgram.com/stack_overflow/
stack_overflow
http://instgram.com/stackoverflowing/
stackoverflowing
http://instgram.com/stackoverflowthestack/
stackoverflowthestack
http://instgram.com/stackoverflowingstacks/
stackoverflowingstacks
The end result should look like User Name Column and be adjustable for any length of usernames (slashes and spaces among other special characters cannot be used in instagram usernames).
Also, I'm unsure why my google docs takes semi-columns instead of commas as I'm used to with Excel, but it is what it is.
Figuring this out would save me loads of time in the long-run and I would be very appreciative.
If you can use TEXTAFTER (Office Insider Beta only, Windows: 2203 (Build 15104), Mac: 16.60 (220304)). You don't have to deal with LEFT/RIGHT/FIND functions. On B2 cell you enter and expand down the following formula:
=SUBSTITUTE(TEXTAFTER(A1,"/",-2),"/","")
Here is the output:
You can replace SUBSTITUTE as follow via TEXTBEFORE to remove the last /:
=TEXTBEFORE(TEXTAFTER(A1,"/",-2),"/")
Explanation
The main idea is that:
TEXTAFTER(text,delimiter,[instance_num], [match_mode],
[match_end], [if_not_found])
allows to search backward, using the third input argument: instance_num (with a negative number). We are interested in the penultimate occurrence of /, therefore this input argument would be -2.
Alternative Solution
If such functions are not available for your excel version the following works using MID(text, start_num, num_chars) function:
=LET(url, A1, length, LEN(url), count, length-LEN(SUBSTITUTE(url,"/",""))-1,
startPos, FIND(" ",SUBSTITUTE(url,"/"," ",count))+1, numChars, length-startPos,
MID(url,startPos, numChars))
Note: We use LET to avoid repeating the same calculation and also to make it easier to understand. Without LET it's a bigger formula but it works too:
=MID(A1, FIND(" ",SUBSTITUTE(A1,"/"," ",
LEN(A1)-LEN(SUBSTITUTE(A1,"/",""))-1))+1, LEN(A1) -
(FIND(" ",SUBSTITUTE(A1,"/"," ",LEN(A1)-LEN(SUBSTITUTE(A1,"/",""))-1))+1))
we are looking for the penultimate / so count name achieves that:
count, length-LEN(SUBSTITUTE(url,"/",""))-1
The following formula:
startPos, FIND(" ",SUBSTITUTE(url,"/"," ",count))+1
is a way to replace just the penultimate / by space ( ). URLs don't have spaces so it is a good replacement. Then finding the position of such space plus 1 will give us the starting position required by MID. Having the starting position and the length of the URL we can calculate the number of characters (numChars), so we can finally invoke MID with the required input arguments.
Note: The approach you tried in your question, relies on specific pattern of the URL (ending with .com) the above approaches don't have such constraint, so you can use the following URL: https://www.redcross.org/ and it works.

Excel formula that produces one of two options

This is my first StackOverflow question, so apologies if I am unclear.
Currently, my work uses an Excel tracking doc to log project info. The column info is like so:
CELL B1 (Project Number) =IF(B2=""," ",MID(B2,FIND("P2",B2),9))
CELL B2 (Project Name) Client / P2XXXXXXX / Name
Thus, the P2XXXXXXX gets pulled out of B2 and populated into B1.
However, management has recently switched systems, so now, some project numbers have the P2XXXXXXX format and others have a PRJ-XXXXX format.
So we need a formula the produces nothing if the cell is blank and EITHER the P2XXXXXXX number or PRJ-XXXXX number if the cell is not blank.
Is it possible? If any further details are needed, let me know. Thanks in advance!
Well, if the / is always there then this can work:
IF(B2="","",MID(B2,FIND("/",B2,1)+2,9))
assuming the name is always 9 characters.
String Between Two Same Characters
Maybe the next month your company will start using a different first letter or could add more numbers e.g. SPRXXXXXXXXXX. So you could solve this problem by extracting whatever is between those two slashes.
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Find the first character =FIND("/",B2), but we need the next one:
=FIND("/",B2)+1
Find the second character but search from the postition after the first found:
=FIND("/",B2,FIND("/",B2)+1)
Now get the string between them:
=MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)
(note how the last minus was 'converted' from a plus to a minus (- + + = -)).
Remove the leading and trailing spaces:
=TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1))
Add the condition when the cell is blank:
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Here's another way using LEFT and RIGHT:
=IF(B2="","",TRIM(LEFT(RIGHT(B2,LEN(B2)-FIND("/",B2)),FIND("/",B2))))
Although you can solve this problem with a combination of slicing, trimming, and complex conditionals, the most expressive and easy to maintain solution is to use regular expressions. Regular expressions have a bit of a learning curve, but there's a great playground website where you can experiment with them, and this page has a pretty good writeup on how regular expressions work in excel.
Specifically, this regular expression addresses the two naming conventions you've highlighted, but it can be updated to support more naming conventions as your company inevitably adds more:
P(RJ-)?((\d){9}|(\d){5})
To break that down from left to right:
P: both patterns start with a "P"
(RJ-)? One pattern follows with "RJ-", but the other doesn't. This is a grouped part of the pattern, and the question mark means that this part of the pattern is optional.
((\d){9}|(\d){5}): by far the nastiest part, but this basically means that there is going to be a sequence of numbers (\d), and there will either be nine of them or five of them. By wrapping the whole thing in parenthesis, they are always the second captured group, no matter the length of the sequence of numbers. This means that you can always extract the project id by looking at the value of the second capture group.
You can also make the expression more generalized by replacing ((\d){9}|(\d){5}) with simply (\d+). That just means "one or more digits." That gives you a much more simplified overall expression of this:
P(RJ-)?(\d+)
Depending on whether or not you care about validating strictly that project ids are 5 OR 9 digits long, that pattern above might be suitable, and it has the benefit of being more flexible. Still, the project ID is in the second captured group.

Excel formula to search partial match and align row

I have 2 column data in Excel like this:
Can somebody help me write a formula in column C that will take the first name or the last name from column A and match it with column B and paste the values in column C. The following picture will give you the exact idea what I am trying to do. Thanks
Since your data is not "regular", you can try this formula which uses wild card to look for just the last name.
=INDEX($B$1:$B$4,MATCH("*" &MID(A1,FIND(" ",A1)+1,99)&"*",$B$1:$B$4,0))
It would be simpler if the first part followed some rule, but some have the first initial of the first name at the beginning; and others at the end.
Edit: (explanation added)
The FIND returns the character number of the first space
Add 1 to that to get the character number of the next word
Use MID to extract that word
Use that word in MATCH (with wild-cards before and after), to find it in the array of email addresses. This will return it's position in the array (row number)
Use that row number as an argument to the INDEX function to return the actual email address.
If you want to first examine the email address, you will need to determine which of the letters comprise the last name. Since this is not regular according to your example, you will need to check both options.
You will not be able to look for the first name from the email, as it is not present.
If you can guarantee that the first part will be follow the rule of one or the other, eg: either
FirstInitialLastName or
LastNameFirstInitial
Then you can try this:
=IFERROR(INDEX($B$1:$B$4,MATCH(MID(A1,FIND(" ",A1)+1,99)& LEFT(A1,1) &"*",$B$1:$B$4,0)),
INDEX($B$1:$B$4,MATCH( LEFT(A1,1)&MID(A1,FIND(" ",A1)+1,99) &"*",$B$1:$B$4,0)))
This seems to do what you want.
=IFERROR(VLOOKUP(LOWER(MID(A1,(SEARCH(" ",A1)+1),LEN(A1)))&LOWER(MID(A1,1,1))&"*",$B$1:$B$4,1,FALSE),VLOOKUP(LOWER(MID(A1,1,1))&LOWER(MID(A1,(SEARCH(" ",A1)+1),LEN(A1)))&"*",$B$1:$B$4,1,FALSE))
Its pretty crazy long and would likely be easier to digest and debug broken up into columns instead of one huge formula.
It basically makes FLast and FirstL out of the name field by splitting on the space.
LastF:
=LOWER(MID(A1,(SEARCH(" ",A1)+1),LEN(A1)))&LOWER(MID(A1,1,1))
And FirstL:
=LOWER(MID(A1,1,1))&LOWER(MID(A1,(SEARCH(" ",A1)+1),LEN(A1)))
We then make 2 vlookups for these by using wildcards:
LastF:
=VLOOKUP([lastfirst equation from above]&"*",$B$1:$B$4,1,FALSE)
And FirstL:
=VLOOKUP([firstlast equation from above]&"*",$B$1:$B$4,1,FALSE)
And then wrap those in IfError so they both get tried:
=IfError([firstLast vlookup],[lastfirst vlookup])
The rub is that's going to be hell to edit if you ever need to, which is why I suggest doing each piece in another column then referencing the previous one.
You also need to be aware that this answer will get tripped up by essentially the same name - e.g. Sam Smith and Sasha Smith would both match whatever the first entry for ssmith was. Any solution here will likely have the same pitfall.

retrieve part of the info in a cell in EXCEL

I vaguely remember that it is possible to parse the data in a cell and keep only part of the data after setting up certain conditions. But I can't remember what exact commands to use. Any help/suggestion?
For example, A1 contains the following info
0/1:47,45:92:99:1319,0,1320
Is there a way to pick up, say, 0/1 or 1319,0,1320 and remove the rest unchosen data?
I know I can do text-to-column and set the delimiter, followed by manually removing the "un-needed" data, but my EXCEL spreadsheet contains 100 columns X 500000 rows with each cell looking similar to the data above, so I am afraid EXCEL may crash before finishing the work. (have been trying with LEFT, LEN, RIGHT, MID, but none seems to work the way I had hoped)
Any suggestion will be greatly appreciated.
I think what you are looking for is combination of find and mid, but you'll have to work out exactly how you want to split your string:
A1 = 0/1:47,45:92:99:1319,0,1320 //your number
B1 = Find(“:“,A1) //location of first ":" symbol
C1 = LEN(A1) - B1 //character count to copy ( possibly requires +1 or -1 after B1.
=Left(A1,B1) //left of your symbol
=Mid(A1,B1+1,C1) //right size from your symbol (you can also replace C1 with better defined number to extract only 1 portion
//You can also nest the statements to save space, but usually at cost of processing quantity increase
This is the concept, you will probably need to do it in multiple cells to split a string as long as yours. For multiple splits you probably want to replicate this command to target the result of previous right/mid command.
That way, you will get cell result sequence like:
0/1:47,45:92:99:1319,0,1320; 47,45:92:99:1319,0,1320; 92:99:1319,0,1320; 99:1319,0,1320......
From each of those you can retrieve left side of the string up to ":" to get each portion of a string.
If you are working with a large table you probably want to look into VB scripting. To my knowledge there is no single excel command that can take 1 cell and split it into multiple ones.
Let me try to help you about this, I am not a professional so you may face some problems. First of all my solution contains 2 columns to be added to the source column as you can see below. However you can improve formulas with this principle.
Column B Formula:
=LEFT(A2,FIND(":",A2,1)-1)
Column C Formula:
=RIGHT(A2,LEN(A2)-FIND("|",SUBSTITUTE(A2,":","|",LEN(A2)-LEN(SUBSTITUTE(A2,":","")))))
Given you statement of having 100x columns I imagine in some instances you are needing to isolate characters in the middle of your string, thus Left and Right may not always work. However, where possible use them where you can.
Assuming your string is in cell F2: 0/1:47,45:92:99:1319,0,1320
=LEFT(F2,3)
This returns 0/1 which are the first 3 characters in the string counting from the left. Likewise, Right functions similarly:
=RIGHT(F2,4)
This returns 1320, returning the 4 characters starting from the right.
You can use a combination of Mid and Find to dynamically find characters or strings based off of defined characters. Here are a few examples of ways to dynamically isloate values in your string. Keep in mind the key to these examples is the nested Find formula, where the inner most Find is the first character to start at in the string.
1) Return 2 characters after the second : character
In cell F2 I need to isolate the "92":
=MID(F2,FIND(":",F2,FIND(":",F2)+1)+1,2)
The inner most Find locates the first : in the string (4 characters in). We add the +1 to move to the 5th character (moving beyond the first : so the second Find will not see it) and move to the next Find which starts looking for : again from that character. This second Find returns 10, as the second : is the 10th character in the string. The Mid formula takes over here. The formula is saying, Starting at the 10th character return the following 2 characters. Returning two characters is dictated by the 2 at the end of the formula (the last part of the Mid formula).
2) In this case I need to find the 2 characters after the 3rd : in the string. In this case "99":
=MID(F2,FIND(":",F2,FIND(":",F2,FIND(":",F2)+1)+1)+1,2)
You can see we have simply added one more nested Find to the formula in example 1.

Resources