I have to make a difficult word processing. How can I change dynamically as the following example?
Example: /hello/ baby /deneme/ /hello2/
Output: (/hello/) baby (/deneme/) (/hello2/)
This is a pretty rudimentary solution, but it works for the case you've given (SQL Fiddle here):
SELECT
in_str,
(
-- If the string starts with '/', prepend '('
CASE WHEN in_str LIKE '/%' THEN '(' ELSE '' END
-- Replace / after a space with (/
+ REPLACE(
-- Replace / before by a space with /)
REPLACE( in_str, ' /', ' (/' ),
'/ ', '/) '
)
-- If the string ends with '/', append ')'
+ CASE WHEN in_str LIKE '%/' THEN ')' ELSE '' END
) AS out_str
FROM table1;
If table1 has the following in_str values this will give you the corresponding out_str values:
in_str out_str
------------------------ ------------------------------
/one/ two /three/ /four/ (/one/) two (/three/) (/four/)
one /two/ /three/ one (/two/) (/three/)
/one/ /two/ three (/one/) (/two/) three
//one / // two/ / (//one (/) (//) two/) (/)
I've included the last one to demonstrate some edge cases. Also note that this only handles / characters immediately followed by a space or the beginning or end of the string. Other whitespace characters like newlines and tabs aren't handled. For example, if you had a string like this (where ⏎ indicates a newline and ⇒ a tab):
/one/⇒/two/⏎
/three/⏎
...the output you would get is this:
(/one/⇒/two/⏎
/three/⏎
You could handle these scenarios with additional REPLACE functions, but that's a rabbit hole you'll have to jump down yourself.
Related
I already did a research and find out about catastrophic backtracking, but I can't figure out if it is the case.
I have a small script:
import re
if __name__ == '__main__':
name = 'vuejs-complete-guide-vue-course.vue.test'
print( name )
extractedDomain = re.findall(r'([A-Za-z0-9\-\_]+){1,63}.([A-Za-z0-9\-\_]+){1,63}$', name)
print( extractedDomain )
This regex does not finalize and I don't understand why.
But if the name be:
name = 'vue-course.vue.test'
Then it works.
Someone can help me?
The issue is catastrophic backtracking due to the nested quantifiers (the quantifier + for the character class and the outer group {1,63})
Your string contains a dot, which can only be matched by the . in your pattern (as the . can match any character)
As your string contains 2 dots which it can not match, it will still try to explore all the paths.
Ending for example the string on a dot like vuejs-complete. can also become problematic as there should be at least a single char other than a dot following.
Looking at the pattern that you tried and the example string, you can repeat the character class 1-63 times, followed by repeating a group 1 or more times starting with a dot.
Note to escape the dot to match it literally.
^[A-Za-z0-9_-]{1,63}(?:\.[A-Za-z0-9_-]{1,63})+$
Explanation
^ Start ofs tring
[A-Za-z0-9_-]{1,63} Repeat the character class 1-63 times
(?: Non capture group to repeat as a whole part
\.[A-Za-z0-9_-]{1,63} Match . and repeat the character class 1-63 times
)+ Close the group and repeat 1+ times
$ End of string
Regex demo
I'm using python's re module to grab all instances of values between the
opening and closing parenthesis.
i.e. (A)way(Of)testing(This)
would produce a list:
['A', 'Of', 'This']
I took a look at 1 and 2.
This is my code:
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r".*\(([a-zA-Z0-9|^)])\).*", re.S)
for s in re.findall(res, sentence):
print(s)
What I get from this is:
it
Then I realized I was only capturing just one character, so I used
res = re.compile(r".*\(([a-zA-Z0-9-|^)]*)\).*", re.S)
But I still get it
I've always struggled with regex. My understanding of my search string
is as follows:
.* (any character)
\( (escapes the opening parenthesis)
( (starts the grouping)
[a-zA-Z0-9-|^)]* (set of characters allowed : a-Z, A-Z, 0-9, - *EXCEPT the ")" )
) (closes the grouping)
\) (escapes the closing parenthesis)
.* (anything else)
So in theory it should go through sentence and once it encounters a (,
it should copy the contents up until it encounters a ), at which point it should
store that into one group. It then proceeds through the sentence.
I even used the following:
res = re.compile(r".*\(([a-z|A-Z|0-9|-|^)]*)\).*", re.S)
But it still returns an it.
Any help greatly appreciated,
Thanks
You can shorten the pattern without the .* and the ^ and ) and only use the character class.
The .* part matches any character, and as the part between parenthesis is only once in the pattern you will capture only 1 group.
In your explanation about this part [a-zA-Z0-9-|^)]* the character class does not rule out the ) using |^). It will just match either a | ^ or ) char.
If you want to use a negated character class, the ^ should be at the start of the character class like [^ but that is not necessary here as you can specify what do you want to match instead of what you don't want to match.
\(([a-zA-Z0-9-]*)\)
The pattern matches:
\( Match (
( Capture group 1
[a-zA-Z0-9-]* Optionally repeat matching one of the listed ranges a-zA-Z0-9 or -
) Close group 1
\) Match )
regex demo
You don't need the re.S as there is no dot in the pattern that should match a newline.
import re
sentence = "(A)way(Of)testing(This)is running (it)"
res = re.compile(r"\(([a-zA-Z0-9-]*)\)")
print(re.findall(res, sentence))
Output
['A', 'Of', 'This', 'it']
I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string
Why the symbol + of the pattern in the regular expression pre-search is invalid ?
regular expression presearch Pattern in python3.
details are as follows
My purpose is to match the symbol dot and any number of adjacent digits to the left of dot in order to extract unmatched parts. Such as
"Contents156.html" -> "Contents" ;
"PingHang12Report_ipad1_1269.html" ->"PingHang12Report_ipad1_" ;
But now it seems that pattern doesn't work because of "Lookaround Is Atomic". So how should I do ?
You are using ?= which "matches next but doesn’t consume any of the string". Your .* matches the return value (including 2 digits) and ?= part find a digit and dot to be the "next" part. Things matched by ?= will not appear in the final result.
If you need a non-greedy match for .* part, use .*? instead.
re.findall(r'.*?(?=\d+\.)', 'PingHang12Report_ipad1_1269.html')
# => ['PingHang12Report_ipad1_', '', '', '', '']`
where you can just take the first element.
Another way to do this,
re.findall(r'(.*?)(\d+\..*)', 'PingHang12Report_ipad1_1269.html')
# => [('PingHang12Report_ipad1_', '1269.html')]
What is an efficient way in MATLAB to replace/insert one symbol (in series of symbols) with several others that correspond to the one that is being replaced?
For example, consider having a string Eq: Eq = 'A*exp(-((x-xc)/w)^2)'. Is there a way to replace * with .*, / with ./,\ with .\, and ^ with .^ without writing four separate strrep() lines?
Regular expressions will do the job nicely. Regular expressions simply find patterns in text. You specify what kind of pattern you are looking for by a regular expression, and the output gives you the locations of where the pattern occurred.
For our particular case, not only do we want to find where patterns occur, we also want to replace those patterns with something else. Specifically, use the function regexprep from MATLAB to replace matches in a string with something else. What you want to do is replace all *, /, \ and ^ symbols by adding a . in front of each.
How regexprep works is that the first input is the string you're looking at, the second input is a pattern that you're trying to find. In our case, we want to find any of *, /, \ and ^. To specify this pattern, you put those desired symbols in [] brackets. Regular expressions reserve \ as a special symbol to delineate characters that can be parsed as a regular expression but actually aren't. As such, you need to use \\ for the \ character and \^ for the ^ character. The third input is what you want to replace each match with. In our case, we simply want to reuse each matched character, but we add a . at the beginning of the match. This is done by doing \.$0 in the regular expression syntax. $0 means to grab the first token produced by a match... which is essentially the matched symbol from the pattern. . is also a reserved keyword using regular expressions, so we must prepend this symbol with a \ character.
Without further ado:
>> Eq = 'A*exp(-((x-xc)/w)^2)';
>> out = regexprep(Eq, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2)
The pattern we are looking for is [*/\\\^], which means that we want to find any of *, /, \ - denoted as \\ in regex, and \^ - denoted as ^ in regex. We want to find any of these symbols and replace them with the same symbol by adding a . character in front - \.$0.
As a more complicated example, let's make sure that we include all of the symbols you're looking for in a sample equation:
>> A = 'A*exp(-((x-xc)/w)^2) \ b^2';
>> out = regexprep(A, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2) .\ b.^2
I'd go with regexp as in rayryeng's answer. But here's another approach, just to provide an alternative.
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
[~, jj] = sort([1:numel(Eq) ii-.5]); %// will be used to properly order the result
result = [Eq repmat('.',1,numel(ii))]; %// insert dots at the end
result = result(jj); %// properly order the result
And a variant:
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
jj = sort([1:numel(Eq) ii-.5]); %// dot locations are marked with fractional part
result = Eq(ceil(jj)); %// repeat characters where the dots will be placed
result(mod(jj,1)>0) = '.'; %// place dots at indices with fractional part
The vectorize function already does almost all of what you want except that it does not convert mldivide (\) to ldivide (.\).
By "efficient," do you mean fewer lines of code or faster? Regular expressions are almost always slower than other approaches and less readable. I don't think they're necessary or a good choice in this case. If you only need to convert your string once, then speed is less of a concern than readability (strrep will still be faster). If you need to do it many times, this simple code that you alluded to is 4–5 times faster than regexrep for short strings like your example (and much faster for longer strings):
out = strrep(Eq,'*','.*');
out = strrep(out,'/','./');
out = strrep(out,'\','.\');
out = strrep(out,'^','.^');
If you want one line, use:
out = strrep(strrep(strrep(strrep(Eq,'*','.*'),'/','./'),'\','.\'),'^','.^');
which will also be slightly faster still. Or create your own version of vectorize and call that.
Where regular expressions shine is in more complex cases, e.g., if your string is already partially vectorized: Eq = 'A.*exp(-((x-xc)/w)^2)'. Even still, the vectorize function just uses strrep and then calls strfind to "remove any possible '..*', '../', etc." and replace them with the proper element-wise operators because it's faster (symbolic math strings can get very large, for example).