How do you extract the first sentence in Excel - excel

I have around 2000 descriptions that need a short description. Here is an example of a description.
Chloe New, the heir of the Original Chloe is warm, feminine and a great signature scent. Chloe is a flowery scent with peonies, freesia, magnolia, lilies and rose petals along with ripe litchis. A hint of woods at the base makes Chloe New the best everyday fragrance. It's long lasting power keeps you fresh all day long.
The result being this
Chloe New, the heir of the Original Chloe is warm, feminine and a great signature scent.
Sometimes other descriptions will end like this, for example:
Chloe New, the heir of the Original Chloe is warm, feminine and a great signature scent. Chloe is
The current function I am using is the "=left(a1,70)" which takes the first 70 characters starting from the left. However, this function doesn't always extract the first sentence but ends at the beginning of the second sentence.
So my question is:
Is there a function that only extracts the first sentence of a cell?

Add a "space" right after "." to prevent the sentence from being prematurely trimmed in the case of an abbreviation like "e.g.":
=LEFT(A1,FIND(". ",A1))

Where A1 is the cell you are evaluating:
=LEFT(A1,SEARCH(".",A1))
SEARCH() finds the index of "." and you evaluate your LEFT() function up to that point.
EDIT: For this use case, my solution and Gary's solution are nearly identical. For other use cases, SEARCH might be preferable to FIND because it performs case-insensitive search and also supports wildcard characters.

if you want to tag off the comma use this:
=LEFT(A1, (SEARCH(",",A1,1))-1)

Related

Excel formula that produces one of two options

This is my first StackOverflow question, so apologies if I am unclear.
Currently, my work uses an Excel tracking doc to log project info. The column info is like so:
CELL B1 (Project Number) =IF(B2=""," ",MID(B2,FIND("P2",B2),9))
CELL B2 (Project Name) Client / P2XXXXXXX / Name
Thus, the P2XXXXXXX gets pulled out of B2 and populated into B1.
However, management has recently switched systems, so now, some project numbers have the P2XXXXXXX format and others have a PRJ-XXXXX format.
So we need a formula the produces nothing if the cell is blank and EITHER the P2XXXXXXX number or PRJ-XXXXX number if the cell is not blank.
Is it possible? If any further details are needed, let me know. Thanks in advance!
Well, if the / is always there then this can work:
IF(B2="","",MID(B2,FIND("/",B2,1)+2,9))
assuming the name is always 9 characters.
String Between Two Same Characters
Maybe the next month your company will start using a different first letter or could add more numbers e.g. SPRXXXXXXXXXX. So you could solve this problem by extracting whatever is between those two slashes.
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Find the first character =FIND("/",B2), but we need the next one:
=FIND("/",B2)+1
Find the second character but search from the postition after the first found:
=FIND("/",B2,FIND("/",B2)+1)
Now get the string between them:
=MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)
(note how the last minus was 'converted' from a plus to a minus (- + + = -)).
Remove the leading and trailing spaces:
=TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1))
Add the condition when the cell is blank:
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Here's another way using LEFT and RIGHT:
=IF(B2="","",TRIM(LEFT(RIGHT(B2,LEN(B2)-FIND("/",B2)),FIND("/",B2))))
Although you can solve this problem with a combination of slicing, trimming, and complex conditionals, the most expressive and easy to maintain solution is to use regular expressions. Regular expressions have a bit of a learning curve, but there's a great playground website where you can experiment with them, and this page has a pretty good writeup on how regular expressions work in excel.
Specifically, this regular expression addresses the two naming conventions you've highlighted, but it can be updated to support more naming conventions as your company inevitably adds more:
P(RJ-)?((\d){9}|(\d){5})
To break that down from left to right:
P: both patterns start with a "P"
(RJ-)? One pattern follows with "RJ-", but the other doesn't. This is a grouped part of the pattern, and the question mark means that this part of the pattern is optional.
((\d){9}|(\d){5}): by far the nastiest part, but this basically means that there is going to be a sequence of numbers (\d), and there will either be nine of them or five of them. By wrapping the whole thing in parenthesis, they are always the second captured group, no matter the length of the sequence of numbers. This means that you can always extract the project id by looking at the value of the second capture group.
You can also make the expression more generalized by replacing ((\d){9}|(\d){5}) with simply (\d+). That just means "one or more digits." That gives you a much more simplified overall expression of this:
P(RJ-)?(\d+)
Depending on whether or not you care about validating strictly that project ids are 5 OR 9 digits long, that pattern above might be suitable, and it has the benefit of being more flexible. Still, the project ID is in the second captured group.

Find value within cell from a series in Excel

I have a list of addresses, such as this:
Lake Havasu,  Lake Havasu City,  Arizona.
St. Johns River,  Palatka,  Florida.
Tennessee River,  Knoxville,  Tennessee.
I would like to extract the State from these addresses and then have a column showing the abbreviated State name (AZ, FL, TN etc.).
I have a table that has the States with their abbreviation and once I extract the State, doing a simple INDEX MATCH to get the abbreviation is easy. I don't want to use text-to-columns because this file will constantly have values added to it and it would be much easier to just have a formula that does the extraction for me.
The ways I've tried to approach this that have failed so far are:
Some kind of SEARCH() function that looks at the full State list and tries to find a value that exists in the cell
A MID or RIGHT approach to only capture the last section but I can't work out how to have FIND only look for the second ", "
A version of INDEX MATCH but that fails because I can't find a good way to search or find the values as per approach (1)
Any help would be appreciated!
Please try this formula, where A2 is the original text.
=FILTERXML("<data><a>" & SUBSTITUTE(A2,", ","</a><a>") & "</a></data>","data/a[3]")
An alternative would be to look for the 2nd comma as shown below. Note that the "50" in the formula is an irrelevant number required by the MID() function. It shouldn't be smaller than the number of characters you need to return, however.
Char(160) is a character that wouldn't (shouldn't) naturally occur in your text, as it might if the text comes from a UNIX database. You can replace it with another one that fits the description.
=TRIM(MID(A2, FIND(CHAR(160),SUBSTITUTE(A2,",",CHAR(160),2)) + 1,50))
The following variation of the above would remove the final period. It will fail if there is anything following the period, such as an unwanted blank. That could be accommodated within the formula as well but it would be easier to treat the original data, if that is an option for you.
=TRIM(MID(LEFT(A2, LEN(A2)-1), FIND(CHAR(160),SUBSTITUTE(A2,",",CHAR(160),2)) + 1,50))
To find the abbreviation I would recommend to use VLOOKUP rather than INDEX/MATCH.
Use this (screenshot refers):
=MID(MID(B3,1,LEN(B3)-1),SEARCH(",",B3,SEARCH(",",B3,1)+1)+3,LEN(B3))

Excel - Match in two columns with variable names

I've got two lists of people and I have to check if they're in both lists. The thing is that characters are not accepted in one of the lists ("-" for instance), and the person might have omitted a last name in case they have two.
For example:
A1 B2
John Paul John Paul Jones
Mary Williams Ryan Roberts
Ryan Roberts-Johnson Mary Williams
My formula is: =IFERROR(MATCH($A1,$B$1:$B$1215,0),IFERROR(MATCH(LEFT($A1,FIND(" ",$A1,1)),$B$1:$B$1215,0),"No Match"))
The idea is: if the name is the same, bring me the line where the person is. If not, look for the first name and see if you find someone with this first and bring it to me. If neither works, reply with "No Match".
But apparently the Match function only retrieves exact matches, so the First Name one doesn't work.
Is there any other way to solve this?
EDIT1: First finding: I can use the SUBSTITUTE formula to replace - with space and do the search once again.
Some of the things I did to save up some time (I probably spent more time figuring out/researching than I would have if I had done ~3500 entries manually, but the learning opportunity was great.)
What I wanted:
To search for for a cell with name + surname when I only had the surname.
What I did: I remembered about the wildcards and used them with VLOOKUP:
First I got the last name and added the star: ="*"&RIGHT($A1,LEN($A1)-FIND(" ",$A1,1))
=VLOOKUP(A1,$B$1:$B$1000,1,FALSE)
And it would try and find the first one. Before that I added a check to make sure that there weren't two people with the surname (so it wouldn't throw another person there) with a simple IF(COUNTIF($C$2:$C$2000,$D219)<=2, and then the rest of the formula.
Something else that I noticed and serves as a reminder: TRIM is very important, as some of the cells had two spaces in between first name/last and some one space after the last letter of the last name. Using TRIM to create a new column made me avoid a lot of mistakes I only took a while to notice.
I believe the case is solved.
You could create a temporary column extracting the first name from B column.
See this example.

Excel First Word (with error checking)

While I can extract the first word from a cell containing multiple text values with error checking to return the only word if no multiple values exist. I cannot seem to wrap my brain around adding more checks (or if it is even possible in the same nested formula) for situations where some of the source cells contain a comma between multiple words. Example, the formula below will return "James" from "James Marriott". But, it returns "James," from "James, Marriott". If all of my cells in the range were consistent that would be easy, but they aren't. Attempts to nest multiple find statements have resulted in failure. Suggestions?
=IFERROR(LEFT(A1,FIND(" ",A2)-1),A2)
To compound matters, there are also cells that contain abbreviations as the first word, so somehow I need to account for that as well. For example "J.W. Marriott" where I need to apply the above logic to extract "Marriott".
Here are some examples below:
Text Desired output
James Marriott James
James, Marriott James
Able Acme Able
Golden, Eagle Golden
J.W. Marriott Marriott
A.B. Acme Acme
you could use regex (to set up please look at the post here)
Then you can extract the desired word with a formula like:
=regex(A1, "(?![Etc])[a-zA-Z]{2,}")
(This is searching for a pattern of two or more lower or upper case letters in the cell A1...and not searching for Etc)

Looking up Bigrams in Excel

Suppose I have a list of two-word pairs in a column in Excel. These words are delimited by a space so that a typical pair might look like "extreme happiness". The goal is to search for these 'bigrams' in a larger string located in another column. The issue is that the bigram will only be found if the two words are together and separated by a space. What would be preferable is if Excel could look for both words anywhere in a given larger string. It is crucial that the bigrams occupy one cell each since a score is assigned to each bigram and in fact the function used VLOOKUPs this value based on the bigram cell value. Would it make sense to change the space between any two words to a - or some other character? Is there a way to have Excel look up each value one at a time (perhaps by recognizing this character and passing through the larger string twice, that is, once for each word)?
Example: "The weather last night was extremely cold, but the warm fire gave me some happiness."
Here we would like to find both the word 'extreme' within the word extremely and the word happiness. Currently Excel would not be successful in doing this since it would just look for "extreme happiness" and determine that no such string exists.
If the bigram in the row below "extreme happiness" reads "weather gave" (for some reason) Excel will go check whether that bigram exists in the larger string and return a second score. This is done so that at the end every score can be added together.
This is pretty easy with a couple of formulas. See screenshot below:
The logic is simple. Assuming your bigram is in B1, we can input the following in C1. This will replace the spaces with *, which is Excel's wildcard character.
=SUBSTITUTE(B2," ","*")
Then we concatenate it to give us a wildcarded beginning and end.
=CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*")
We then use a simple COUNTIF against the statement (here in A1) to return to us a count of occurence.
=COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))
A simple IF check enclosing the above, with condition >0, can be used to give us either Yes or No.
=IF(COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))>0,"Yes","No")
Let us know if this helps.

Resources