Extract Sub-String from String in Excel (No VBA) - excel

I have a series of paths in excel which follow the pattern:
C:\Folder\Subfolder1\SURNAME, Firstname\Subfolder2\SURNAME, Firstname - YYYY MM DD - Invoice.pdf
I cannot use VBA, so using an array formula, how would I extract SURNAME, Firstname?

You may use:
=TRIM(MID(SUBSTITUTE(A1,"\",REPT(" ",LEN(A1))),3*LEN(A1)+1,LEN(A1)))
Where 3* could be read as n-1, so change to whichever number to get the nth substring from a delimited string.
Another option, with access to FILTERXML:
=FILTERXML("<t><s>"&SUBSTITUTE(A1,"\","</s><s>")&"</s></t>","//s[position()=4]")
This would essentially pull the 4th substring from a "\" delimited string. Change position()=4 to the nth position if you like to retrieve other substrings. This option seems a bit longer, but could become handy when you want to retrieve multiple substrings where you just need to change up the XPATH.
After your commend, I think you might want to try:
=FILTERXML("<t><s>"&SUBSTITUTE(SUBSTITUTE(A6," - ","\"),"\","</s><s>")&"</s></t>","//s[position()=last()-2]")

With data in A1, in B1 enter:
=LEFT(RIGHT(A1,LEN(A1)-FIND("#",SUBSTITUTE(A1,"\","#",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))),1)),FIND("-",RIGHT(A1,LEN(A1)-FIND("#",SUBSTITUTE(A1,"\","#",LEN(A1)-LEN(SUBSTITUTE(A1,"\",""))),1)))-2)

Or try,
In B1 copied across right until blank and all copied down :
=TRIM(MID(SUBSTITUTE("\"&MID($A1,FIND("\",$A1,4)+1,FIND("-",$A1,4)-FIND("\",$A1,4)-2),"\",REPT(" ",199)),COLUMN(A1)*399,199))

Related

splitting underscores in Excel

I'm fairly new to Excel and need some assistance. I have a Column that has a list of files that look like:
12345_v1.0_TEST_Name [12345]_01.01.2022.html
45321_v55.9_Some Name Here [64398]_07.15.2018.html
56871_v14.2_Test[64398]_10.30.2019.html
Each file name can be different depending on what output is provided to me.
Note: There are other random files in the same format, however where it says Test_Name there could be an underscore and sometimes no underscore. Would like that to be ignored in the formula or vba. Files also can change but will be in the same format.
I need some help with a formula or vba that splits the underscores and outputs the data into their own cells:
Column C 12345
Column D v1.0
Column E TEST_Name [12345]
Column F 01.01.2022
Column G .html
Since there can be different file extensions however the format remains same, hence the above formula which i provided has been amended with some few tweaks so that it works for any file extensions,
FORMULA IN CELL C1
=IF(LEN($B1)-LEN(SUBSTITUTE($B1,"_",""))+1>4,
TRIM(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE($B1,"."&TRIM(RIGHT(SUBSTITUTE(
SUBSTITUTE($B1,"."," ",LEN($B1)-LEN(SUBSTITUTE($B1,".","")))," ",REPT(" ",200)),100)),"_"&"."&
TRIM(RIGHT(SUBSTITUTE(SUBSTITUTE($B1,"."," ",LEN($B1)-LEN(SUBSTITUTE($B1,".","")))," ",
REPT(" ",200)),100))),"_"," ",3),"_",REPT(" ",100)),COLUMN(A1)*99-98,100)),
TRIM(MID(SUBSTITUTE(SUBSTITUTE($B1,"."&TRIM(RIGHT(SUBSTITUTE(SUBSTITUTE(
$B1,"."," ",LEN($B1)-LEN(SUBSTITUTE($B1,".","")))," ",REPT(" ",200)),100)),"_"&"."&
TRIM(RIGHT(SUBSTITUTE(SUBSTITUTE($B1,"."," ",LEN($B1)-LEN(SUBSTITUTE($B1,".","")))," ",
REPT(" ",200)),100))),"_",REPT(" ",100)),COLUMN(A1)*99-98,100)))
FILL DOWN & FILL ACROSS!!!
There are other random files in the same format.....Files also can change but will be in the same format.
So, assuming the files indeed will be in the same format, we can brake this query down into the following requirements:
Change the 1st and 2nd occurence and the very last of the underscore into anything to split on;
Change the dot before the file-extension into anything to split on under the assumption we don't know if this would be '.html' or any other extension.
Since you have Microsoft365 we can use dynamic arrays and some basic functions to retrieve what you want:
=LET(X,SEARCH("_??.??.????.",A1),Y,"</s><s>",TRANSPOSE(FILTERXML("<t><s>"&SUBSTITUTE(SUBSTITUTE(REPLACE(A1,X,12,Y&MID(A1,X+1,10)&Y),"_",Y,2),"_",Y,1)&"</s></t>","//s")))
To break this down a little bit:
SEARCH("_??.??.????.",A1) - This part will make sure that we find the position of the very last underscore upto the dot before the file extension assuming you don't have any other date in your filenames in this specific format;
SUBSTITUTE() - We can use this formula to specifically change the 1st and 2nd instances of the underscore to anything we can split on;
FILTERXML() - You may notice we used valid xml start/end-tags to split our data using this function.
TRANSPOSE() - This last function will now spill the returned array over the columns instead of rows.
Without LET():
=TRANSPOSE(FILTERXML("<t><s>"&SUBSTITUTE(SUBSTITUTE(REPLACE(A1,SEARCH("_??.??.????.",A1),12,"</s><s>"&MID(A1,SEARCH("_??.??.????.",A1)+1,10)&"</s><s>"),"_","</s><s>",2),"_","</s><s>",1)&"</s></t>","//s"))
Is this what you are trying to achieve, although there might be more eloquent way to use a formula, and solve this, however you may try using this as well,
FORMULA USED IN CELL C1
=IF(LEN($B2)-LEN(SUBSTITUTE($B2,"_",""))+1>4,TRIM(MID(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE($B2,".html","_.html"),"_"," ",3),"_",REPT(" ",100)),COLUMN(A2)*99-98,100)),TRIM(MID(SUBSTITUTE(SUBSTITUTE($B2,".html","_.html"),"_",REPT(" ",100)),COLUMN(A2)*99-98,100)))
Fill Down & Fill Across !

reformat excel text column to specific format

I have a column in my excel that includes authors name and it looks as follows:
My goal is to remove the dates + the last comma from all of these rows to make it something like this:
Is there a way I can do it in excel?
Based on your example, in which there are multiple commas in one cell, I would go with determining the position of the last comma first (in order to know where to slice the content of said cell). Then it's a matter of IF formula based on condition in which the last 4 characters in the cell are digits:
=IF(ISNUMBER(VALUE(RIGHT(A1,4))),LEFT(A1,FIND("#",SUBSTITUTE(A1,",","#",LEN(A1)-LEN(SUBSTITUTE(A1,",",""))))-1),A1)
FYI: The "#" substitution is targeted at knowing exactly where the last comma occurs in the cell. Any other unique, not-appearing-in-the-string character would have done the same job.
I've tested the formula on below examples:

Excel formula to remove number + character from text

Is is possible to remove to achieve the following using formula? I want to get rid of the numbers at the end. Thanks.
This is a possibility with the following assumptions:
The delimiter to differentiate between number and rest of string can be both - or _
There always is a number to be taken off from the string
The formula used in B2:
=LEFT(A2,LEN(A2)-LEN(TRIM(RIGHT(SUBSTITUTE(SUBSTITUTE(A2,"_","-"),"-",REPT(" ",LEN(A2))),LEN(A2))))-1)
Drag down...
You need to use the LEFT() function to get rid of text at the end. Sample syntax:
LEFT(cell_id,LEN(cell_id)-num_chars)
For example, if you wanted to remove the last 3 characters from cell A4:
LEFT(A4,LEN(A4)-3)
However, in your case, it looks like you want to get rid of text after the last occurrence of a certain delimiter/separator - that being "-" or "_", so try these two:
LEFT(A4,FIND("#",SUBSTITUTE(A1,"-","#",LEN(A1)-LEN(SUBSTITUTE(A1,"-",""))))-1)
and
LEFT(A4,FIND("#",SUBSTITUTE(A1,"_","#",LEN(A1)-LEN(SUBSTITUTE(A1,"_",""))))-1)

Formula to extract numbers from a text string

How could I extract only the numbers from a text string in Excel or Google Sheets? For example:
A1 - a1b23eg67
A2 - 15dgrgr156
Result desired is
B1 - 12367
B2 - 15156
You can do it with capture groups in Google Sheets
=REGEXREPLACE(A1,ʺ(\d)|.ʺ,ʺ$1ʺ)
Anything which matches the contents of the brackets (a digit) will be copied to the output, anything else replaced by an empty string.
Please see #Max Makhrov's answer to this question
or
=regexreplace(A1,ʺ[^\d]ʺ,ʺʺ)
to remove anything which isn't a digit.
Because you asked for Excel also,
If you have a subscription to office 365 Excel then you can use this array formula:
=--TEXTJOIN("",TRUE,IF(ISNUMBER(--MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1)),MID(A1,ROW(INDIRECT("1:"&LEN(A1))),1),""))
Being an array formula it needs to be confirmed with Ctrl-Shift-Enter instead of Enter when exiting edit mode. If done correctly then Excel will put {} around the formula.
I would imagine there is a way to pull this off with =RegexExtract but I can't figure out how to get it to repeat the search after the first hit. Often with these regex function implementations there is a third parameter to repeat, but it doesn't look like google implemented it.
At any rate, the following formula will do the trick. It's just a little roundabout:
=concatenate(SPLIT( LOWER(A1) , "abcdefghijklmnopqrstuvwxyz" ))
This is converting the string to lower case, then splitting the string using any letter of the alphabet. This will return an array of the numbers left over, which we concatenate back together.
Update, switched over to =REGEXREPLACE() instead of extract...:
=regexreplace(A1, "[a-z]", "")
That's a much cleaner and obvious way of doing it than that concat(split()) nonsense.

Locate number string in Excel array?

I have an array of numbers in Excel beginning in B2 as follows:
CA.CAD.CP.0.0.0.0.1.CY
CA.CAD.CP.0.0.0.0.2.CY
CA.CAD.CP.0.0.0.1.0.CY
CA.CAD.CP.0.0.0.2.0.CY
CA.CAD.CP.0.0.3.0.0.CY
CA.CAD.CP.0.0.0.6.0.CY
CA.CAD.OIS.0.0.0.1.0.CY
CA.CAD.OIS.0.0.0.2.0.CY
CA.CAD.OIS.0.0.0.3.0.CY
CA.CAD.OIS.0.0.6.0.0.CY
CA.CAD.OIS.0.0.0.9.0.CY
CA.CAD.OIS.1.0.0.0.0.CY
CA.CAD.ONT.0.0.0.1.0.CY
CA.CAD.ONT.0.0.0.2.0.CY
CA.CAD.ONT.0.0.0.3.0.CY
CA.CAD.ONT.0.0.6.0.0.CY
CA.CAD.ONT.1.0.0.0.0.CY
for several thousand rows. All of them follow this exact format. The numbers represent a date format; D.W.F.M.Y. So 0.0.0.5.0 means 5 months, for example.
I want to find all instances where the date value is "F", meaning all instances of "xx.xxx.xxx.0.0.x.0.0".
What is the best way to do this? I have tried using the FIND function but I think there might be a better way to search for this string.
This will return True/False based on whether the middle, or "F" position is anything but 0 or not:
=--MID(SUBSTITUTE(B2,".",REPT(" ",99)),5*99,99)<>0
With data starting in A2, in B2 enter:
=TRIM(MID(SUBSTITUTE($A2,".",REPT(" ",999)),COLUMNS($A:A)*999-998,999))
and copy across and then down:
Then set an AutoFilter on column G to display non-zero values.
Have you thought to use Word's Find feature? I understand it's in excel - but copy and paste data into Word - it's Find capabilities allow you to search for variables even formatting and special characters including tabs, and punctuation - you can use the Find/Replace feature to have it perform some special maneuvers to mark your text before simply copy/paste special back into excel when finished with Word's special unique features - it's find/replace capabilities are stronger than any other Office program

Resources