OpenRefine text transform delete all characters from character "[" - text

I want to delete all characters after the "[" character like the screenshot below:
Schreenshot
How do I achieve this?

Input examples from the provided screenshot:
189 [122-270]
18 [5.10-6.90]
The easiest way would be to split the column on the separator [ and delete the second column.
The more elaborate way would be to use a regular expression:
value.replace(/\s\[[^\]]+\]/, "")
Play around with https://regex101.com/ or https://regexr.com/ to find out more about how this regular expression works.

if you want to skip regex complexity, I will use the split function value.split('[')[0]
The split() function create an array, and the [0] selector pick the first value of the array.

Related

Replace all non-alphanumeric characters, including wildcards

I take this beautiful formula from JvdV answer:
=TRIM(CONCAT(IF(ISNUMBER(SEARCH(MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1),"-./ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ")),MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1)," ")))
This formula replace any non-alphanumeric character (&^%]#$) with simple space " ".
I put in formula some exception (-./ ), but this is not all exceptions.
How about wildcards? How to filter wildcards (~*?) with this formula?
I think: Ok, I will use FIND instead of SEARCH and all will be right, just put lowercase and uppercase alphabet in the FIND index, like this: *"-./ 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"*
Then I think: But, what if I want to keep not only numeric and regular alphabet? What if I want to keep all diacritics, like this: "ÁÀȦÄǍĀÃÅĄȺẤẦẮẰǠǺǞẪẴẢȀȂẨẲẠḀẬẶĂÂḂɃƁḄḆĆĊĈČÇȻḈƇƆḊĎḐĐƊḌḒḎÐƉÉÈĖÊËĚĔĒẼĘȨɆẾỀḖḔỄḜẺȄȆỂẸḘḚỆÉÈÊËḞƑǴĠĜǦĞḠĢǤƓḢĤḦȞḨĦḤḪⱧÍÌİÏǏĬĪĨĮƗḮỈȈȊỊḬÍÌÏÎȷĴǰḰǨĶƘᶄḲḴⱩꝀꝂꝄĹĿĽⱢⱠĻȽŁḶḼḺḸꝈḾṀṂŃǸṄŇÑŅƝṆṊṈÑŊÓÒȮÔÖǑŎŌÕǪŐỐỒƟØṒṐṌȪỖṎǾȬǬỎȌȎƠỔỌỚỜỠỘỞỢÓÒÔÖÕØṔṖⱣƤƦŔṘŘŖɌⱤȐȒṚṞṜŚṠŜŠṤṦṢṨŞṪŤƬṬƮṰṮȾŢŦÚÙÛÜǓŬŪŨŮŲŰɄǗǛṸṺỦȔȖƯỤṲỨỪṶṴỮỬỰÚÙÛÜṼṾẂẀẆŴẄẈẊẌÝỲẎŶŸȲỸɎỶƳỴÝŹŻẐŽƵẒẔ"
Then lowercase and uppercase alphabet is too much for FIND index.
Ok, for SEARCH index is also too much, because function accept max. 255 length, but lets say we have only 200 characters in index (numbers, alphabet and some diacritics)
So, the question is available:
How to filter (replace with space) wildcards (~*?) with this kind of formula?
As I read this question there are a few problems:
How to include over 255 characters in the 2nd parameter of SEARCH();
How to exclude literal wildcard characters in the 2nd parameter of SEARCH();
One way around the length limit is to feed SEARCH() an array of options, in this case an array of two elements of a lenght of <255:
Formula in C1:
=TRIM(CONCAT(IF(MMULT(IFERROR(SEARCH("~"&MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1),{"ÁÀȦÄǍĀÃÅĄȺẤẦẮẰǠǺǞẪẴẢȀȂẨẲẠḀẬẶĂÂḂɃƁḄḆĆĊĈČÇȻḈƇƆḊĎḐĐƊḌḒḎÐƉÉÈĖÊËĚĔĒẼĘȨɆẾỀḖḔỄḜẺȄȆỂẸḘḚỆÉÈÊËḞƑǴĠĜǦĞḠĢǤƓḢĤḦȞḨĦḤḪⱧÍÌİÏǏĬĪĨĮƗḮỈȈȊỊḬÍÌÏÎȷĴǰḰǨĶƘᶄḲḴⱩꝀꝂꝄĹĿĽⱢⱠĻȽŁḶḼḺḸꝈḾṀṂŃǸṄŇÑŅƝṆṊṈÑŊÓÒȮÔÖǑŎŌÕǪŐỐỒƟØṒṐṌȪỖṎǾȬǬỎȌȎƠỔỌỚỜỠỘỞỢÓÒÔÖÕØṔṖⱣƤƦŔṘŘŖɌⱤ";"ȐȒṚṞṜŚṠŜŠṤṦṢṨŞṪŤƬṬƮṰṮȾŢŦÚÙÛÜǓŬŪŨŮŲŰɄǗǛṸṺỦȔȖƯỤṲỨỪṶṴỮỬỰÚÙÛÜṼṾẂẀẆŴẄẈẊẌÝỲẎŶŸȲỸɎỶƳỴÝŹŻẐŽƵẒẔ-./*? 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"}),0),{1,1}),MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1)," ")))
What we did here is:
Use an horizontal array {abc;xyz} to check against our characters which was an vertical array {a,b,c}. Note the difference between semi-column and comma.
The result will be a 2D-array which MMULT() can sum. Meaning if the character was found in any of the two elements of the array it will return that same character. Otherwise, a space.
The special wildcard characters are now also included with an extra tilde to escape them as with actually all characters.
If Excel doesn't recognize all lowercase diacritics as their uppercase counterparts, just add them to one of the two elements. If need be, add a 3rd. But know that you'd need to extend on the 2nd parameter in MMULT() too then.
To visualize the above:
Remember, you are using Excel 2019 which means you need to CSE-enter this formula. Needles to say that all will be much easier in ms365 using its dynamic array functionality.

Excel: Find words of certain length in string?

I have this file where I want to make a conditional check for any cell that contains the letter combination "_SOL", or where the string is followed by any numeric character like "_SOL1524", and stop looking after that. So I don't want matches for "_SOLUTION" or "_SOLothercharactersthannumeric".
So when I use the following formula, I also get results for words like "_SOLUTION":
=IF(ISNUMBER(FIND("_SOL",A1))=TRUE,"Yay","")
How can I avoid this, and only get matches if the match is "_SOL" or "_SOLnumericvalue" (one numeric character)
Clarification: The whole strings may be "Blabla_SOL_BLABLA", "Blabla_SOLUTION_BLABLA" or "Blabla_SOL1524_BLABLA"
Maybe this, which will check if the character after "_SOL" is numeric.
=IF(ISNUMBER(VALUE(MID(A1,FIND("_SOL",A1)+4,1))),"Yay","")
Or, as per OP's request and suggestion, to include the possibility of an underscore after "SOL"
=IF(OR(ISNUMBER(VALUE(MID(A1,FIND("_SOL",A1)+4,1))),ISNUMBER(FIND("_SOL_",A1))),"Yay","")
Here is an alternative way to check if your string contains SOL followed by either nothing or any numeric value up to any characters after SOL:
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","1</s><s>")&"</s></t>","//s[substring-after(.,'SOL')*0=0]")>0),"Yey","Nay")
Just to use in an unfortunate event where you would encounter SOL1TEXT for example. Or, maybe saver (in case you have text like AEROSOL):
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","</s><s>")&"</s></t>","//s[translate(.,'1234567890','')='SOL']")>0),"Yey","Nay")
And to prevent that you have text like 123SOL123 you could even do:
=IF(COUNT(FILTERXML("<t><s>"&SUBSTITUTE(A1,"_","1</s><s>")&"</s></t>","//s[starts-with(., 'SOL') and substring(., 4)*0=0]")>0),"Yey","Nay")

How to make FIND function exact in Excel

I'm using the FIND function in Excel to check whether certain characters appear in a string of characters in a cell.
However, this function doesn't work cleanly for certain special characters. Specifically F̌,B̌, and some others. When F̌ appears in the string, FIND recognizes it as both F and F̌.
Notable that this is not the case for characters such as Ď and Č. FIND works nicely for these.
How can I get the formula to always differentiate between characters with and without the hat? Is there a way to work in EXACT?
Thank you!
It is because F̌ is actually two characters.
=LEN("F̌") returns 2 not 1. The second character is the hat.
If you do:
=UNICHAR(70)&UNICHAR(780)
It will return the F̌
And as such =FIND("F","F̌") will return 1 as it is the first letter of a two character string.
To find "F" in A,B,F̌,F use:
=AGGREGATE(15,7,ROW($ZZ1:INDEX($ZZ:$ZZ,LEN(A1)))/((MID(A1,ROW($ZZ1:INDEX($ZZ:$ZZ,LEN(A1))),1)="F")*(MID(A1,ROW($ZZ2:INDEX($ZZ:$ZZ,LEN(A1)+1)),1)<>UNICHAR(780))),1)
To find either then we need to use IF:
=IF(LEN(A2)=2,FIND(A2,A1),AGGREGATE(15,7,ROW($ZZ$1:INDEX($ZZ:$ZZ,LEN(A1)))/((MID(A1,ROW($ZZ$1:INDEX($ZZ:$ZZ,LEN(A1))),1)=A2)*(MID(A1,ROW($ZZ$2:INDEX($ZZ:$ZZ,LEN(A1)+1)),1)<>UNICHAR(780))),1))
Given that your substrings are comma-separated, look for the character followed by a comma (and add a comma to the end of the string to find the last character).
This allows you to separate multicharacter substrings from uni-character substrings where the latter is contained in the former.
You could use something like:
=FIND("F,",A5&",")
That will find an F in A5, but will not find an F if only F̌ is present

Excel Remove only last characters if they match

I've been trying a few different ways to try and search and replace on excell to remove the last couple of characters.
For instance in one column I have product name S
I want to remove the " S" only.
I have tried some if formulas a swell and not had much luck. I'm assuming there is a simple regex that can be used for the search and replace e.g. " S/" that would just replace if its the last characters and has nothing after it.
Try using the SUBSTITUTE function and replace the letters you want to remove with a unique character/ word / space not appearing anywhere else in the booklet, depending on which part of the string you're trying to remove and what format you're trying to keep
then find and replace ( CTRL +F) that word with the black (space) character
see how to use SUBSTITUTE function here:
https://exceljet.net/excel-functions/excel-substitute-function
Since you are only interested in the end of the string, I don't think you need regex or anything too sophisticated.
If I understand correctly, you want to get the original string (product name S) up until but not including something that appears at the end (S). This means that in your example, you want the 12 leftmost digits: the digits of the original string (14) minus the digits of the pattern (2) - this would give you product name. If the original string does not end with the pattern, you want the original string.
Therefore, I suggest the following:
=IF(RIGHT("original string",LEN("pattern"))="pattern",
LEFT("original string",LEN("original string")-LEN("pattern")),
"original string")
Check these examples:

How to replace wildcharacter in CSV

I have below string in csv files
Part Number WP1166496 (AP6005317) replaces 1166496, 1156976.
Expected Output -
Part Number WP1166496 replaces 1166496, 1156976.
I want to replace (AP6005317) this with blanks.
As there are many rows with different values.
So how can I replace this string with brackets to blanks value.
I don't know how to achieve this exactly in Microsoft Excel.
If you look for find and replace feature, most probably you can see option to replace with regular expressions.
Use regular expression option and replace \(.*\) with (simple space). This will solve your problem.
Note : This is tested and verified in LibreOffice Calc.

Resources