Regexp_replace explanation for PySpark

Regexp_replace explanation for PySpark - apache-spark

I have checked the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.regexp_replace.html
But cannot for the life of me figure out why this part
r'(\d+)'
Leads to changing
'100-200'
to
'-----'
Anyone with good documentation on that? I believe the \d section looks for 0-9 but that's about as far as I get. I don't understand in which sequence you need to do what either.

\d matches a digit i.e 0-9 and + matches the previous token between one and unlimited times, as many times as possible, giving back as needed.
Column has value 100-200. according to above statement, 100 matches for the regex(\d will match each digit of 100 separately but + matches 100 completely). So 100 will be replaced by --. In the same way 200 will be replaced by --. Finally we will have ----- as column value.
Brackets are used if incase we want to group to capture later using index starting with 1.
Let's say we want to extract only 1st matched value in a column then in spark we can use regexp_extract as shown below:
df.select(regexp_extract('column', '(\d+)', 1)) # 1 is groupIndex
In python Prefix r used before a regular expression, it marks raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.
If you want to match "\n" and if you don't use r prefix then you have to escape \ like this "\\n" in your regex expression.
You can practice/test regex in this website, you will get real time explanation about what's happening in background. You can go through this simple cheatsheet.

Related

How can I substitute multiple occurrences of junk strings in Excel?

In the image, 'muddle' is the string containing junk words and the strings I want to extract. There is a fixed list of junk words - the good strings could be literally anything.
You can see this formula has correctly extracted "moo" and "coo", which are not in the list of junk words. The formula is below.
=LET(junkStart,FILTER(SEARCH(Table1[junkwords],Table2[muddle]),ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle]))),
junkEnd,FILTER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1,ISNUMBER(SEARCH(Table1[junkwords],Table2[muddle])+LEN(Table1[junkwords])-1)),
goodstart,FILTER(junkEnd+1,(junkEnd+1<=LEN(Table2[muddle]))*(ISERROR(XMATCH(junkEnd+1,junkStart)))),
goodend,FILTER(junkStart-1,(junkStart-1>=LEN(1))*(ISERROR(XMATCH(junkStart-1,junkEnd))))+1,
goodchars,goodend-goodstart,
TEXTJOIN("; ",TRUE,MID(Table2[muddle],goodstart,goodchars)))
This works well, but it falls down if a junk word occurs more than once. See below.
The only difference is that 'woo' occurs twice in the second example.
I need a single cell solution. VBA is not an option for me. Using the name manager would be untidy, as would nested formulas.
I've got this far with formulas, which as far as I can tell is the furthest anyone has got with the 'removing multiple words from a cell' problem. I can see the issue - once SEARCH locates the start of a string in a cell, it doesn't go looking for a second occurrence of that string. But I don't know how to find the start of every instance of every string. Can anyone help?

REDUCE is perfect for this:
=REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(m,j,SUBSTITUTE(m,j,"")))
REDUCE starts at the Table2[muddle] value as m then it substitutes the first value of Table1[junkwords] j with "" the outcome becomes the new m which will get a substitute of the second value of j. The result will be the new m, etc.
If you would want to have it comma separated it becomes more complicated, but you can realize by:
=LET(t,SUBSTITUTE(","&REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y,",")))&",",",,",","),
MID(t,2,LEN(t)-3))
This does almost the same as the previous solution, but instead of substituting for blanks it substitutes for , and substitutes all duplicate ,, for singles, so if more substitutes followed eachother it results in one comma. Also, if the first and/or last part got substituted by a single ,, then the result would have a leading and/or trailing ,. This is solved by first adding , in the front and back before substituting the double comma's for singles. the result t is then wrapped in MID, where the first and last character (both being a ,) are removed.
Alternate solution:
=LET(t,REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))),
SUBSTITUTE(TRIM(t)," ",","))
Or in one go if you don't want to use LET:
=SUBSTITUTE(TRIM(REDUCE(Table2[muddle],Table1[junkwords],LAMBDA(x,y,SUBSTITUTE(x,y," "))))," ",",")
This replaces the junk words with a space. Regardless how many junk words in between words or how many trailing or leading spaces TRIM will fix it to the words separated by one space only. Substituting the spaces for comma gets to your result.

There's no single-formula solution if the junkwords list is not fixed.
Instead, you may choose to use the Substitute() function on each cell of the "Extracted Strings" column to substitute all occurances of each junk word in muddle, i.e. substitute "boo" muddle, then substitute "voo" in the resulted string, replace "noo" in the resulted string...so on. You will get the last cell.
One point to note though, you need to ensure no substring / partial strings problem in the junkwords or you need to define the rules of processing in order for the solution to be "complete". Consider the followings:
junk words = abc, def, cde
muddle = 1234abcdef5678
if you process the string in the above order, you got "12345678"
if you process the junk words in reverse order, you got "123abf5678"

What do you understand by this RegEx?

I´m working with VBA and trying to split a string into three columns, almost all strings are like Company Name 3567782 Agent Name.pdf
With this pattern I want to match all the text before a space and digits (1st group), the digits (2nd group) and all the text after the space and before the .pdf (3rd group).
strPattern = "^(.+)\n(\d{4,10})\n(.+).pdf"
I recall spaces in python are \s but saw in VBA are \n.
Can you help me find the right pattern for what I´m looking for?

As I put in my comment, I use the https://regex101.com site. There are others but I find this one the most helpful to me.
When I put in your regex
^(.+)\n(\d{4,10})\n(.+).pdf
and test string
Company Name 3567782 Agent Name.pdf
the first thing I notice is that the regex does not match the test string (see right side under MATCH INFORMATION).
Here are a couple things that I saw:
\n is newline, not space. In regex, space is " ".
Your last "." in ".pdf" is not registering as a literal period, it's a token that matches any character. To match a literal period, you need \.
If we change those two things it returns three groups that seem to match what you are looking for.
^(.+) (\d{4,10}) (.+)\.pdf
It looks like for the digits, you are looking for between 4 and 10 digits. If that's correct, it looks like your regex is good. You could put in a handful of example strings into the TEST STRING area and make sure that it works in all cases.

I'd use either of these:
(?:(?:([a-zA-Z]+\.?)|(\d+)))
capture a-Z greedy with a possible . to allow for the .pdf or capture digits
this version excludes the space [ ] or \s
or keep the search structured so you can control what goes in and out of each column
^(\w+\s\w+)|(\d+)|(\w+\s\w+\.\w+$)
\b or ^ - word boundary or start of string
(\w+\s\w+) - 1st capture \w+ - any alpha numeric char greedily, followed by 1 x space (use \s* or \s+ for more), followed again by alpha numeric greedily
|(\d+) - alteration - \d+ - capture just digits
`|(\w+\s\w+.\w+$) - similar to 1st group but allows for the '.' of pdf and bounds to the end of string (\G or $).
you could optionally build the '.' into the 1st group like my top answer, but for neatness and better control I prefer the 2nd.

VIM line count in status bar with thousands separator?

Is it possible to display the line count in the VIM status bar with thousands separators, preferably custom thousands separators?
Example:
set statusline=%L
should lead to "1,234,567" instead of "1234567".

I've found a way but it looks a bit crazy:
set statusline=%{substitute(line('$')\,'\\d\\zs\\ze\\%(\\d\\d\\d\\)\\+$'\,'\,'\,'g')}
The first round of backslashes is just for set (I have to escape , and \ itself).
What I'm actually setting the option to is this string:
%{substitute(line('$'),'\d\zs\ze\%(\d\d\d\)\+$',',','g')}
As a format string, this line contains one formatting code, which is %{...}. Everything in ... is evaluated as an expression and the result substituted back in.
The expression I'm evaluating is (spaces added (if I had added them to the real code, I would've had to escape them for set again, forcing yet more backslashes)):
substitute(line('$'), '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g')
This is a call to the substitute function. The arguments are the source string, the regex, the replacement string, and a list of flags.
The string we're starting with is line('$'). This call returns the number of lines in the current buffer (or rather the number of the last line in the buffer). This is what %L normally shows.
The search pattern we're looking for is \d(\d\d\d)+$ (special vim craziness removed), i.e. a digit followed by 1 or more groups of 3 digits, followed by the end of the string. Grouping is spelled \%( \) in vim, and "1 or more" is \+, which gives us \d\%(\d\d\d\)\+$. The last bit of magic is \zs\ze. \zs sets the start of the matched string; \ze sets the end. This works as if everything before \zs were a look-behind pattern and everything after \ze were a look-ahead pattern.
What this amounts to is: We're looking for every position in the source string that is preceded by a digit and followed by exactly N digits (where N is a multiple of 3). This works like starting at the right and going left, skipping 3 digits each time. These are the positions where we need to insert a comma.
That's what the replacement string is: ',' (a comma). Because we're matching a string of length 0, we're effectively inserting into the source string (by replacing '' with ',').
Finally, the g flag says to do this with all matches, not just the first one.
TL;DR:
line('$') gives us the number of lines
substitute(..., '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g') adds commas where we want them
%{ } lets us embed arbitrary expressions into statusline

Correlate the Boundary value in Load Runner 12.5

I am using loadrunner 12.5. In the below value I need to correlate and get the value 1aqeid!None (the None will also be filled with numbers so its dynamic)
Example:
1. {id:'1aqeid!None!123456',paramName:'jsessionId'};
2. {id:'1aqeid!zxsjfn12536782ldfj!123456',paramName:'jsessionId'};
I need to get only the below value
1. 1aqeid!None
2. 1aqeid!zxsjfn12536782ldfj
web_reg_save_param("ID","LB=id:'","RB=!","ORD=1",LAST);
I am not able to find the solution.

"LB={id:'",
"RB=',paramName:'jsessionID'",
"ORD=ALL",
LAST
This will leave you with:
1aqeid!None!{some value you do not need}
You have a number of options at this point. You could use strtok() and split your string with a '!' as a separator, you could parse the string to find the location of the second instance of the '!' in the character array and then take a substring using strncpy() before that as your value, or..... The point here is that you can collect more than you need and then trim down based upon a known separator in the data.

Tell vim to add commas to a number, e.g. change 31415926 to 31,415,926

I have a very large number (a couple hundred digits long), and I'd like to use vim to add commas to the number in the appropriate manner, i.e. after each group of three digits, moving from right to left. How can I do this efficiently?

Taken from here
Substitue command that adds commas in the right spot.
:%s/\(\d\)\(\(\d\d\d\)\+\d\#!\)\#=/\1,/g
This uses a zero width lookahead to match any number that isn't followed by groups of three numbers followed by one number. (or 3n+1 numbers)
So the numbers that match in are marked with ^. These are then replaced with a comma after it the match.
31415926
^ ^
Which replaces to
31,415,926

A friend of mine suggests using the printf program: ciw<C-r>=system("printf \"%'d\" ".shellescape(#"))<CR>.

This is one way of doing it:
s/\d\{-1,}\ze\(\d\{3}\)\+\s/&,/g
Notes:
\{-1,} is saying match at least 1 but in a non-greedy way (Vim doesn't seem to support the usual \+\? syntax; also, for quantifiers, you just need to escape the opening curly brace)
\ze is saying match the pattern behind this but don't store the match in & (equivalent to positive look-ahead)
\(\d\{3}\)\+\> matches groups of 3 digits that ends with word-nonword boundary (word in this sense means alphanumerical + underscore).
Alternatively, you can use \s for space/tab, or \D for non-digit instead of \>, whichever fits your needs better

The way that I used is to create a macro that adds one single comma, and then invoke the macro a whole bunch of times, like qahhi,<ESC>hq#a#a#a#a…

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Regexp_replace explanation for PySpark - apache-spark

Related

How can I substitute multiple occurrences of junk strings in Excel?

What do you understand by this RegEx?

VIM line count in status bar with thousands separator?

Correlate the Boundary value in Load Runner 12.5

Tell vim to add commas to a number, e.g. change 31415926 to 31,415,926

Categories

Resources