How can I accomplish this regex? - python-3.x

I want a regex that matches "cell" values from A-J rows to 1-10 columns.
For example it should match A10, A1, E9
It should not match A100, A30, P7, A01
By the time being, I came up with this regex:
(?:[ABCDEFGHIJabcdefghij][123456789](?![123456789]))(?<=1)(0)?
The only case where it fails is when you give it a A100 cell, it matches the first two characters when in reality it should not return a match.
EDIT:
Playing around a little bit, I wrote:
(?<!\S)[ABCDEFGHIJabcdefghij]123456789((?<=1)(0))?(?!\S)
Which seems to work for even most cases. I´m still open to suggestions on how to improve it / write it more elegantly.

You can shorten the pattern using a ranges. Then you could match either 10 or 1-9 using an alternation instead of using (?<=1)(0)? to match 10.
To prevent the partial match, you can use word boundaries.
\b[A-Ja-j](?:10|[1-9])\b
\b A word boundary
[A-Ja-j] Match either chars in a range from A-J or a-j
(?:10|[1-9]) Match either 10 or a single digit 1-9
\b A word boundary
Regex demo
With whitespace boundaries on the left and right:
(?<!\S)[A-Ja-j](?:10|[1-9])(?!\S)

Related

How can I substitute multiple occurrences of junk strings in Excel?

In the image, 'muddle' is the string containing junk words and the strings I want to extract. There is a fixed list of junk words - the good strings could be literally anything.
You can see this formula has correctly extracted "moo" and "coo", which are not in the list of junk words. The formula is below.
=LET(junkStart,FILTER(SEARCH(Table1[junkwords],Table2[muddle]),ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle]))),
junkEnd,FILTER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1,ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1)),
goodstart,FILTER(junkEnd+1,(junkEnd+1<=LEN(Table2[muddle]))*(ISERROR(XMATCH(junkEnd+1,junkStart)))),
goodend,FILTER(junkStart-1,(junkStart-1>=LEN(1))*(ISERROR(XMATCH(junkStart-1,junkEnd))))+1,
goodchars,goodend-goodstart,
TEXTJOIN("; ",TRUE,MID(Table2[muddle],goodstart,goodchars)))
This works well, but it falls down if a junk word occurs more than once. See below.
The only difference is that 'woo' occurs twice in the second example.
I need a single cell solution. VBA is not an option for me. Using the name manager would be untidy, as would nested formulas.
I've got this far with formulas, which as far as I can tell is the furthest anyone has got with the 'removing multiple words from a cell' problem. I can see the issue - once SEARCH locates the start of a string in a cell, it doesn't go looking for a second occurrence of that string. But I don't know how to find the start of every instance of every string. Can anyone help?
REDUCE is perfect for this:
=REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(m,j,SUBSTITUTE(m,j,"")))
REDUCE starts at the Table2[muddle] value as m then it substitutes the first value of Table1[junkwords] j with "" the outcome becomes the new m which will get a substitute of the second value of j. The result will be the new m, etc.
If you would want to have it comma separated it becomes more complicated, but you can realize by:
=LET(t,SUBSTITUTE(","&REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y,",")))&",",",,",","),
MID(t,2,LEN(t)-3))
This does almost the same as the previous solution, but instead of substituting for blanks it substitutes for , and substitutes all duplicate ,, for singles, so if more substitutes followed eachother it results in one comma. Also, if the first and/or last part got substituted by a single ,, then the result would have a leading and/or trailing ,. This is solved by first adding , in the front and back before substituting the double comma's for singles. the result t is then wrapped in MID, where the first and last character (both being a ,) are removed.
Alternate solution:
=LET(t,REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))),
SUBSTITUTE(TRIM(t)," ",","))
Or in one go if you don't want to use LET:
=SUBSTITUTE(TRIM(REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))))," ",",")
This replaces the junk words with a space. Regardless how many junk words in between words or how many trailing or leading spaces TRIM will fix it to the words separated by one space only. Substituting the spaces for comma gets to your result.
There's no single-formula solution if the junkwords list is not fixed.
Instead, you may choose to use the Substitute() function on each cell of the "Extracted Strings" column to substitute all occurances of each junk word in muddle, i.e. substitute "boo" muddle, then substitute "voo" in the resulted string, replace "noo" in the resulted string...so on. You will get the last cell.
One point to note though, you need to ensure no substring / partial strings problem in the junkwords or you need to define the rules of processing in order for the solution to be "complete". Consider the followings:
junk words = abc, def, cde
muddle = 1234abcdef5678
if you process the string in the above order, you got "12345678"
if you process the junk words in reverse order, you got "123abf5678"

Regex hyphen causing issues

I'm trying to use python to extract invoice info from a text file. The items always start with a quantity (1 x, 2 x, etc) and end with a part number, for various reasons the descriptions can differ even for the same item. The following regex seems to pull out most of the entries, but fails when there is a hyphen '-' in the description.
Regex:
^([0-9]+\sx) [\w\s\r\n®.,]* [0-9]{3}-\d[A-Z]\d+(-AB)?
I'm struggling to work out how to modify the regex to deal with this. I though adding (\s-\s) in would help, but the regex then seems to match parts of different entries rather than the individual ones. I'm sure there are more efficient ways of writing the regex, but I don't use them that often, so getting this far was an achievement. Any pointers gratefully received.
Example entries shown below. With the above regex, the middle one doesn't get matched unless I delete the '-'
1 x Product® 100, Low range items
3,456.00 USD
Cat. No. 012-3A45
1 x Product® 100. Low range items - LRI
123,456.00 USD
Part. no. 201-5G95
2 x Product Mid Range Items
7,654.00 USD
Art. no. 001-8Q147-AB
You can add a hyphen and make the pattern non greedy
^([0-9]+\sx) [-\w\s®.,]*? [0-9]{3}-\d[A-Z]\d+(-AB)?
Regex demo
Note that \s also matches a newline.

Replace all non-alphanumeric characters, including wildcards

I take this beautiful formula from JvdV answer:
=TRIM(CONCAT(IF(ISNUMBER(SEARCH(MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1),"-./ 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ")),MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1)," ")))
This formula replace any non-alphanumeric character (&^%]#$) with simple space " ".
I put in formula some exception (-./ ), but this is not all exceptions.
How about wildcards? How to filter wildcards (~*?) with this formula?
I think: Ok, I will use FIND instead of SEARCH and all will be right, just put lowercase and uppercase alphabet in the FIND index, like this: *"-./ 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"*
Then I think: But, what if I want to keep not only numeric and regular alphabet? What if I want to keep all diacritics, like this: "ÁÀȦÄǍĀÃÅĄȺẤẦẮẰǠǺǞẪẴẢȀȂẨẲẠḀẬẶĂÂḂɃƁḄḆĆĊĈČÇȻḈƇƆḊĎḐĐƊḌḒḎÐƉÉÈĖÊËĚĔĒẼĘȨɆẾỀḖḔỄḜẺȄȆỂẸḘḚỆÉÈÊËḞƑǴĠĜǦĞḠĢǤƓḢĤḦȞḨĦḤḪⱧÍÌİÏǏĬĪĨĮƗḮỈȈȊỊḬÍÌÏÎȷĴǰḰǨĶƘᶄḲḴⱩꝀꝂꝄĹĿĽⱢⱠĻȽŁḶḼḺḸꝈḾṀṂŃǸṄŇÑŅƝṆṊṈÑŊÓÒȮÔÖǑŎŌÕǪŐỐỒƟØṒṐṌȪỖṎǾȬǬỎȌȎƠỔỌỚỜỠỘỞỢÓÒÔÖÕØṔṖⱣƤƦŔṘŘŖɌⱤȐȒṚṞṜŚṠŜŠṤṦṢṨŞṪŤƬṬƮṰṮȾŢŦÚÙÛÜǓŬŪŨŮŲŰɄǗǛṸṺỦȔȖƯỤṲỨỪṶṴỮỬỰÚÙÛÜṼṾẂẀẆŴẄẈẊẌÝỲẎŶŸȲỸɎỶƳỴÝŹŻẐŽƵẒẔ"
Then lowercase and uppercase alphabet is too much for FIND index.
Ok, for SEARCH index is also too much, because function accept max. 255 length, but lets say we have only 200 characters in index (numbers, alphabet and some diacritics)
So, the question is available:
How to filter (replace with space) wildcards (~*?) with this kind of formula?
As I read this question there are a few problems:
How to include over 255 characters in the 2nd parameter of SEARCH();
How to exclude literal wildcard characters in the 2nd parameter of SEARCH();
One way around the length limit is to feed SEARCH() an array of options, in this case an array of two elements of a lenght of <255:
Formula in C1:
=TRIM(CONCAT(IF(MMULT(IFERROR(SEARCH("~"&MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1),{"ÁÀȦÄǍĀÃÅĄȺẤẦẮẰǠǺǞẪẴẢȀȂẨẲẠḀẬẶĂÂḂɃƁḄḆĆĊĈČÇȻḈƇƆḊĎḐĐƊḌḒḎÐƉÉÈĖÊËĚĔĒẼĘȨɆẾỀḖḔỄḜẺȄȆỂẸḘḚỆÉÈÊËḞƑǴĠĜǦĞḠĢǤƓḢĤḦȞḨĦḤḪⱧÍÌİÏǏĬĪĨĮƗḮỈȈȊỊḬÍÌÏÎȷĴǰḰǨĶƘᶄḲḴⱩꝀꝂꝄĹĿĽⱢⱠĻȽŁḶḼḺḸꝈḾṀṂŃǸṄŇÑŅƝṆṊṈÑŊÓÒȮÔÖǑŎŌÕǪŐỐỒƟØṒṐṌȪỖṎǾȬǬỎȌȎƠỔỌỚỜỠỘỞỢÓÒÔÖÕØṔṖⱣƤƦŔṘŘŖɌⱤ";"ȐȒṚṞṜŚṠŜŠṤṦṢṨŞṪŤƬṬƮṰṮȾŢŦÚÙÛÜǓŬŪŨŮŲŰɄǗǛṸṺỦȔȖƯỤṲỨỪṶṴỮỬỰÚÙÛÜṼṾẂẀẆŴẄẈẊẌÝỲẎŶŸȲỸɎỶƳỴÝŹŻẐŽƵẒẔ-./*? 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"}),0),{1,1}),MID(A1,ROW(A$1:INDEX(A:A,LEN(A1))),1)," ")))
What we did here is:
Use an horizontal array {abc;xyz} to check against our characters which was an vertical array {a,b,c}. Note the difference between semi-column and comma.
The result will be a 2D-array which MMULT() can sum. Meaning if the character was found in any of the two elements of the array it will return that same character. Otherwise, a space.
The special wildcard characters are now also included with an extra tilde to escape them as with actually all characters.
If Excel doesn't recognize all lowercase diacritics as their uppercase counterparts, just add them to one of the two elements. If need be, add a 3rd. But know that you'd need to extend on the 2nd parameter in MMULT() too then.
To visualize the above:
Remember, you are using Excel 2019 which means you need to CSE-enter this formula. Needles to say that all will be much easier in ms365 using its dynamic array functionality.

How to make partial match between two strings with mispelled characters

I have a list of 12 digit alphanumeric codes and I need to match against a list of entries where codes might be misspelled.
For example, if the exact code is "K4I3T9OTG9GZ" the entry I have to check might be "K413T90TGS" (1 instead of capital I, 0 instead of capital O, S instead of Z).
I need to do a partial match to be able to find the right code.
Any ideas?
I already tried VLOOKUP with wildcards which worked for most entries with at least five consecutive right characters, but I still have a couple of hundred entries with no match.
Maybe this will help (array formula - Ctrl+Shift+Enter):
=SUMPRODUCT(--ISNUMBER(MATCH(MID($B$2,ROW($A$1:$A$12),1),MID($B$1,ROW($A$1:$A$12),1),0)))
The formula will check each character, one by one, and compare it against the "original"/"exact" code. In your example the result would be 7, as seven characters are matched exactly:
K 4 - 3 T 9 - T G -
{1;1;0;1;1;1;0;1;1;0;0;0}
Here's the full picture:

retrieve part of the info in a cell in EXCEL

I vaguely remember that it is possible to parse the data in a cell and keep only part of the data after setting up certain conditions. But I can't remember what exact commands to use. Any help/suggestion?
For example, A1 contains the following info
0/1:47,45:92:99:1319,0,1320
Is there a way to pick up, say, 0/1 or 1319,0,1320 and remove the rest unchosen data?
I know I can do text-to-column and set the delimiter, followed by manually removing the "un-needed" data, but my EXCEL spreadsheet contains 100 columns X 500000 rows with each cell looking similar to the data above, so I am afraid EXCEL may crash before finishing the work. (have been trying with LEFT, LEN, RIGHT, MID, but none seems to work the way I had hoped)
Any suggestion will be greatly appreciated.
I think what you are looking for is combination of find and mid, but you'll have to work out exactly how you want to split your string:
A1 = 0/1:47,45:92:99:1319,0,1320 //your number
B1 = Find(“:“,A1) //location of first ":" symbol
C1 = LEN(A1) - B1 //character count to copy ( possibly requires +1 or -1 after B1.
=Left(A1,B1) //left of your symbol
=Mid(A1,B1+1,C1) //right size from your symbol (you can also replace C1 with better defined number to extract only 1 portion
//You can also nest the statements to save space, but usually at cost of processing quantity increase
This is the concept, you will probably need to do it in multiple cells to split a string as long as yours. For multiple splits you probably want to replicate this command to target the result of previous right/mid command.
That way, you will get cell result sequence like:
0/1:47,45:92:99:1319,0,1320; 47,45:92:99:1319,0,1320; 92:99:1319,0,1320; 99:1319,0,1320......
From each of those you can retrieve left side of the string up to ":" to get each portion of a string.
If you are working with a large table you probably want to look into VB scripting. To my knowledge there is no single excel command that can take 1 cell and split it into multiple ones.
Let me try to help you about this, I am not a professional so you may face some problems. First of all my solution contains 2 columns to be added to the source column as you can see below. However you can improve formulas with this principle.
Column B Formula:
=LEFT(A2,FIND(":",A2,1)-1)
Column C Formula:
=RIGHT(A2,LEN(A2)-FIND("|",SUBSTITUTE(A2,":","|",LEN(A2)-LEN(SUBSTITUTE(A2,":","")))))
Given you statement of having 100x columns I imagine in some instances you are needing to isolate characters in the middle of your string, thus Left and Right may not always work. However, where possible use them where you can.
Assuming your string is in cell F2: 0/1:47,45:92:99:1319,0,1320
=LEFT(F2,3)
This returns 0/1 which are the first 3 characters in the string counting from the left. Likewise, Right functions similarly:
=RIGHT(F2,4)
This returns 1320, returning the 4 characters starting from the right.
You can use a combination of Mid and Find to dynamically find characters or strings based off of defined characters. Here are a few examples of ways to dynamically isloate values in your string. Keep in mind the key to these examples is the nested Find formula, where the inner most Find is the first character to start at in the string.
1) Return 2 characters after the second : character
In cell F2 I need to isolate the "92":
=MID(F2,FIND(":",F2,FIND(":",F2)+1)+1,2)
The inner most Find locates the first : in the string (4 characters in). We add the +1 to move to the 5th character (moving beyond the first : so the second Find will not see it) and move to the next Find which starts looking for : again from that character. This second Find returns 10, as the second : is the 10th character in the string. The Mid formula takes over here. The formula is saying, Starting at the 10th character return the following 2 characters. Returning two characters is dictated by the 2 at the end of the formula (the last part of the Mid formula).
2) In this case I need to find the 2 characters after the 3rd : in the string. In this case "99":
=MID(F2,FIND(":",F2,FIND(":",F2,FIND(":",F2)+1)+1)+1,2)
You can see we have simply added one more nested Find to the formula in example 1.

Resources