unable to match regex with text [duplicate]

unable to match regex with text [duplicate] - node.js

https://regex101.com/r/sB9wW6/1
(?:(?<=\s)|^)#(\S+) <-- the problem in positive lookbehind
Working like this on prod: (?:\s|^)#(\S+), but I need a correct start index (without space).
Here is in JS:
var regex = new RegExp(/(?:(?<=\s)|^)#(\S+)/g);
Error parsing regular expression: Invalid regular expression:
/(?:(?<=\s)|^)#(\S+)/
What am I doing wrong?
UPDATE
Ok, no lookbehind in JS :(
But anyways, I need a regex to get the proper start and end index of my match. Without leading space.

Make sure you always select the right regex engine at regex101.com. See an issue that occurred due to using a JS-only compatible regex with [^] construct in Python.
JS regex - at the time of answering this question - did not support lookbehinds. Now, it becomes more and more adopted after its introduction in ECMAScript 2018. You do not really need it here since you can use capturing groups:
var re = /(?:\s|^)#(\S+)/g;
var str = 's #vln1\n#vln2\n';
var res = [];
while ((m = re.exec(str)) !== null) {
res.push(m[1]);
}
console.log(res);
The (?:\s|^)#(\S+) matches a whitespace or the start of string with (?:\s|^), then matches #, and then matches and captures into Group 1 one or more non-whitespace chars with (\S+).
To get the start/end indices, use
var re = /(\s|^)#\S+/g;
var str = 's #vln1\n#vln2\n';
var pos = [];
while ((m = re.exec(str)) !== null) {
pos.push([m.index+m[1].length, m.index+m[0].length]);
}
console.log(pos);
BONUS
My regex works at regex101.com, but not in...
First of all, have you checked the Code Generator link in the Tools pane on the left?
All languages - "Literal string" vs. "String literal" alert - Make sure you test against the same text used in code, literal string, at the regex tester. A common scenario is copy/pasting a string literal value directly into the test string field, with all string escape sequences like \n (line feed char), \r (carriage return), \t (tab char). See Regex_search c++, for example. Mind that they must be replaced with their literal counterparts. So, if you have in Python text = "Text\n\n abc", you must use Text, two line breaks, abc in the regex tester text field. Text.*?abc will never match it although you might think it "works". Yes, . does not always match line break chars, see How do I match any character across multiple lines in a regular expression?
All languages - Backslash alert - Make sure you correctly use a backslash in your string literal, in most languages, in regular string literals, use double backslash, i.e. \d used at regex101.com must written as \\d. In raw string literals, use a single backslash, same as at regex101. Escaping word boundary is very important, since, in many languages (C#, Python, Java, JavaScript, Ruby, etc.), "\b" is used to define a BACKSPACE char, i.e. it is a valid string escape sequence. PHP does not support \b string escape sequence, so "/\b/" = '/\b/' there.
All languages - Default flags - Global and Multiline - Note that by default m and g flags are enabled at regex101.com. So, if you use ^ and $, they will match at the start and end of lines correspondingly. If you need the same behavior in your code check how multiline mode is implemented and either use a specific flag, or - if supported - use an inline (?m) embedded (inline) modifier. The g flag enables multiple occurrence matching, it is often implemented using specific functions/methods. Check your language reference to find the appropriate one.
line-breaks - Line endings at regex101.com are LF only, you can't test strings with CRLF endings, see regex101.com VS myserver - different results. Solutions can be different for each regex library: either use \R (PCRE, Java, Ruby) or some kind of \v (Boost, PCRE), \r?\n, (?:\r\n?|\n)/(?>\r\n?|\n) (good for .NET) or [\r\n]+ in other libraries (see answers for C#, PHP). Another issue related to the fact that you test your regex against a multiline string (not a list of standalone strings/lines) is that your patterns may consume the end of line, \n, char with negated character classes, see an issue like that. \D matched the end of line char, and in order to avoid it, [^\d\n] could be used, or other alternatives.
php - You are dealing with Unicode strings, or want shorthand character classes to match Unicode characters, too (e.g. \w+ to match Стрибижев or Stribiżew, or \s+ to match hard spaces), then you need to use u modifier, see preg_match() returns 0 although regex testers work - To match all occurrences, use preg_match_all, not preg_match with /...pattern.../g, see PHP preg_match to find multiple occurrences and "Unknown modifier 'g' in..." when using preg_match in PHP?- Your regex with inline backreference like \1 refuses to work? Are you using a double quoted string literal? Use a single-quoted one, see Backreference does not work in PHP
phplaravel - Mind you need the regex delimiters around the pattern, see https://stackoverflow.com/questions/22430529
python - Note that re.search, re.match, re.fullmatch, re.findall and re.finditer accept the regex as the first argument, and the input string as the second argument. Not re.findall("test 200 300", r"\d+"), but re.findall(r"\d+", "test 200 300"). If you test at regex101.com, please check the "Code Generator" page. - You used re.match that only searches for a match at the start of the string, use re.search: Regex works fine on Pythex, but not in Python - If the regex contains capturing group(s), re.findall returns a list of captures/capture tuples. Either use non-capturing groups, or re.finditer, or remove redundant capturing groups, see re.findall behaves weird - If you used ^ in the pattern to denote start of a line, not start of the whole string, or used $ to denote the end of a line and not a string, pass re.M or re.MULTILINE flag to re method, see Using ^ to match beginning of line in Python regex
- If you try to match some text across multiple lines, and use re.DOTALL or re.S, or [\s\S]* / [\s\S]*?, and still nothing works, check if you read the file line by line, say, with for line in file:. You must pass the whole file contents as the input to the regex method, see Getting Everything Between Two Characters Across New Lines. - Having trouble adding flags to regex and trying something like pattern = r"/abc/gi"? See How to add modifers to regex in python?
c#, .net - .NET regex does not support possessive quantifiers like ++, *+, ??, {1,10}?, see .NET regex matching digits between optional text with possessive quantifer is not working - When you match against a multiline string and use RegexOptions.Multiline option (or inline (?m) modifier) with an $ anchor in the pattern to match entire lines, and get no match in code, you need to add \r? before $, see .Net regex matching $ with the end of the string and not of line, even with multiline enabled - To get multiple matches, use Regex.Matches, not Regex.Match, see RegEx Match multiple times in string - Similar case as above: splitting a string into paragraphs, by a double line break sequence - C# / Regex Pattern works in online testing, but not at runtime - You should remove regex delimiters, i.e. #"/\d+/" must actually look like #"\d+", see Simple and tested online regex containing regex delimiters does not work in C# code - If you unnecessarily used Regex.Escape to escape all characters in a regular expression (like Regex.Escape(#"\d+\.\d+")) you need to remove Regex.Escape, see Regular Expression working in regex tester, but not in c#
dartflutter - Use raw string literal, RegExp(r"\d"), or double backslashes (RegExp("\\d")) - https://stackoverflow.com/questions/59085824
javascript - Double escape backslashes in a RegExp("\\d"): Why do regex constructors need to be double escaped?
- (Negative) lookbehinds unsupported by most browsers: Regex works on browser but not in Node.js - Strings are immutable, assign the .replace result to a var - The .replace() method does change the string in place - Retrieve all matches with str.match(/pat/g) - Regex101 and Js regex search showing different results or, with RegExp#exec, RegEx to extract all matches from string using RegExp.exec- Replace all pattern matches in string: Why does javascript replace only first instance when using replace?
javascriptangular - Double the backslashes if you define a regex with a string literal, or just use a regex literal notation, see https://stackoverflow.com/questions/56097782
java - Word boundary not working? Make sure you use double backslashes, "\\b", see Regex \b word boundary not works - Getting invalid escape sequence exception? Same thing, double backslashes - Java doesn't work with regex \s, says: invalid escape sequence - No match found is bugging you? Run Matcher.find() / Matcher.matches() - Why does my regex work on RegexPlanet and regex101 but not in my code? - .matches() requires a full string match, use .find(): Java Regex pattern that matches in any online tester but doesn't in Eclipse - Access groups using matcher.group(x): Regex not working in Java while working otherwise - Inside a character class, both [ and ] must be escaped - Using square brackets inside character class in Java regex - You should not run matcher.matches() and matcher.find() consecutively, use only if (matcher.matches()) {...} to check if the pattern matches the whole string and then act accordingly, or use if (matcher.find()) to check if there is a single match or while (matcher.find()) to find multiple matches (or Matcher#results()). See Why does my regex work on RegexPlanet and regex101 but not in my code?
scala - Your regex attempts to match several lines, but you read the file line by line (e.g. use for (line <- fSource.getLines))? Read it into a single variable (see matching new line in Scala regex, when reading from file)
kotlin - You have Regex("/^\\d+$/")? Remove the outer slashes, they are regex delimiter chars that are not part of a pattern. See Find one or more word in string using Regex in Kotlin - You expect a partial string match, but .matchEntire requires a full string match? Use .find, see Regex doesn't match in Kotlin
mongodb - Do not enclose /.../ with single/double quotation marks, see mongodb regex doesn't work
c++ - regex_match requires a full string match, use regex_search to find a partial match - Regex not working as expected with C++ regex_match - regex_search finds the first match only. Use sregex_token_iterator or sregex_iterator to get all matches: see What does std::match_results::size return? - When you read a user-defined string using std::string input; std::cin >> input;, note that cin will only get to the first whitespace, to read the whole line properly, use std::getline(std::cin, input); - C++ Regex to match '+' quantifier - "\d" does not work, you need to use "\\d" or R"(\d)" (a raw string literal) - This regex doesn't work in c++ - Make sure the regex is tested against a literal text, not a string literal, see Regex_search c++
go - Double backslashes or use a raw string literal: Regular expression doesn't work in Go - Go regex does not support lookarounds, select the right option (Go) at regex101.com before testing! Regex expression negated set not working golang
groovy - Return all matches: Regex that works on regex101 does not work in Groovy
r - Double escape backslashes in the string literal: "'\w' is an unrecognized escape" in grep - Use perl=TRUE to PCRE engine ((g)sub/(g)regexpr): Why is this regex using lookbehinds invalid in R?
oracle - Greediness of all quantifiers is set by the first quantifier in the regex, see Regex101 vs Oracle Regex (then, you need to make all the quantifiers as greedy as the first one)] - \b does not work? Oracle regex does not support word boundaries at all, use workarounds as shown in Regex matching works on regex tester but not in oracle
firebase - Double escape backslashes, make sure ^ only appears at the start of the pattern and $ is located only at the end (if any), and note you cannot use more than 9 inline backreferences: Firebase Rules Regex Birthday
firebasegoogle-cloud-firestore - In Firestore security rules, the regular expression needs to be passed as a string, which also means it shouldn't be wrapped in / symbols, i.e. use allow create: if docId.matches("^\\d+$").... See https://stackoverflow.com/questions/63243300
google-data-studio - /pattern/g in REGEXP_REPLACE must contain no / regex delimiters and flags (like g) - see How to use Regex to replace square brackets from date field in Google Data Studio?
google-sheets - If you think REGEXEXTRACT does not return full matches, truncates the results, you should check if you have redundant capturing groups in your regex and remove them, or convert the capturing groups to non-capturing by add ?: after the opening (, see Extract url domain root in Google Sheet
sed - Why does my regular expression work in X but not in Y?
word-boundarypcrephp - [[:<:]] and [[:>:]] do not work in the regex tester, although they are valid constructs in PCRE, see https://stackoverflow.com/questions/48670105
snowflake-cloud-data-platform snowflake-sql - If you are writing a stored procedure, and \\d does not work, you need to double them again and use \\\\d, see REGEX conversion of VARCHAR value to DATE in Snowflake stored procedure using RLIKE not consistent.

Related

Lua pattern matching: When can anchors be safely omitted?

The reference manual describes pattern & anchors as follows:
A pattern is a sequence of pattern items. A '^' at the beginning of a pattern anchors the match at the beginning of the subject string. A '$' at the end of a pattern anchors the match at the end of the subject string. At other positions, '^' and '$' have no special meaning and represent themselves.
Clearly, if a pattern ends with .* or .+ (no matter whether inside a capture group), a trailing $ anchor may be safely omitted, as the entire remaining sequence will be matched either way by the last greedy quantifier; for .-, the anchor may not be omitted though, as that wouldn't force it to match all characters to the end.
But not for the "beginning" of string anchor, it seems the same holds: ^.* and ^.+ can simply be converted into .* and .+ respectively. However, surprisingly, it seems that this time - perhaps due to the way patterns are implemented - ^.- can indeed be simplified to .-, at least from my testing. Even though the docs state:
a single character class followed by '-', which also matches 0 or more repetitions of characters in the class. Unlike '*', these repetition items will always match the shortest possible sequence;
If it isn't anchored, the pattern matching could start at a later position, thus matching a shorter sequence for .- - yet this isn't happening:
$ lua
Lua 5.3.4 Copyright (C) 1994-2017 Lua.org, PUC-Rio
> ("00000000000000000000000001"):match".-1"
00000000000000000000000001
> ("00000000000000000000000001"):match"^.-1"
00000000000000000000000001
>
Is this somehow guaranteed or specified behavior, or is it just "undefined" behavior and should the anchor ^ still be used to stay on the safe side should the implementation change?

There are two things you need to bear in mind when using Lua patterns (and any patterns in general):
There are pattern strings that are used to match specific texts
There are libraries, methods or functions in programming languages that parse the pattern strings and extract/replace/remove/split the input strings based on the incoming pattern logic.
Thus, please make sure you understand what your pattern does and how a specific function/method uses the pattern.
If you use match and ^.-1, the result will be a substring that matches at the start of string (^), then has any zero or more chars as few as possible up to the leftmost occurrence of 1. The ^ is a pattern part that guarantees that matching starts only at the start of string. However, match only searches for a single match (it is not gmatch) and . in Lua patterns matches any char (including line break chars). Thus, .-1 with match will yield the same match.
Once you use gmatch to find multiple matches, ^.-1 and .-1 patterns will start making difference.
If you use it in a replacing/removing context, the difference will be visible at once, too, since by default, these methods - and string.gsub is not an exception - replace all found matches: "Its basic use is to substitute the replacement string for all occurrences of the pattern inside the subject string" (see 20.1 – Pattern-Matching Functions).

Regex to select everything but a pattern of 13 digits in a row

I have a String such as
https://www.mywebsite.com/123_05547898_8101060027367_00.jpeg
, and using Regex & NodeJs, I need to select everything except a pattern of 13 digits in a row (i.e. without other char types between digits).
Thus, I'm expecting to select:
https://www.mywebsite.com/123_05547898__00.jpeg
In other words, I would need the opposite of
\d{13}
Anyone got an idea?
Thanks for your help.

You can use
text.replace(/\d{13}|(.)/g, '$1')
text.replace(/(_)\d{13}(?=_)|(.)/g, '$1$2') // only in between _s
See the regex demo.
The \d{13}|(.) pattern matches thirteen digits or any one char other than line break chars (LF and CR) while capturing it into Group 1. To put back this char, the $1 backreference is used in the replacement pattern.
Note there is no regex construct like "match some text other than a sequence of more than one character" (it is only supported in Lucene regex flavor that is rather a specific regex flavor). There is no way to emulate such a construct in JavaScript (it is possible in PCRE where you can use an alternation with (*SKIP)(*FAIL) and a tempered greedy token).

How to search and replace using regular expressions in Visual Studio

I need to replace all the urls with empty string:
""regular"": ""http://fonts.gstatic.com/s/abhayalibre/v3/zTLc5Jxv6yvb1nHyqBasVy3USBnSvpkopQaUR-2r7iU.ttf"",
""500"": ""http://fonts.gstatic.com/s/abhayalibre/v3/wBjdF6T34NCo7wQYXgzrc5MQuUSAwdHsY8ov_6tk1oA.ttf"",
""600"": ""http://fonts.gstatic.com/s/abhayalibre/v3/wBjdF6T34NCo7wQYXgzrc2v8CylhIUtwUiYO7Z2wXbE.ttf"",
""700"": ""http://fonts.gstatic.com/s/abhayalibre/v3/wBjdF6T34NCo7wQYXgzrc0D2ttfZwueP-QU272T9-k4.ttf"",
""800"": ""http://fonts.gstatic.com/s/abhayalibre/v3/wBjdF6T34NCo7wQYXgzrc_qsay_1ZmRGmC8pVRdIfAg.ttf""
I've tried using the Regular Expressions with:
"http://fonts(*).ttf"
but i can't see the replace working.

Your mistake is (*), use instead:
http://fonts.+\.ttf

Regular Expression Search and Replace is actually quite well documented.
At the moment you're matching strings that look like this, unless Visual Studio actually fails to parse the expression because of the incorrect usage of *.
http://font).ttf
http://font().ttf
http://font(().ttf
http://font(((().ttf
http://font((((((((((((((((((((((((((((((().ttf
etc
To match any character you could use .*, . being the universal match in Regex, but that will match beyond the closing quotes.
Instead, you can use [^"]+ to match one or more characters except ".
http://font\.[^"]+
Also, note the \. to make sure the regex actually matches the . character, the \ escapes it from being the universal match character.

Regex Search/Replace is... inverted

Up to a few days ago my Sublime text 3 was working just fine. I could search/replace regular strings and use regular expressions patterns as well and when a capture group got a match, all of them were highlighted perfectly.
However, since yesterday, everything I search is matching... reversely. Here:
image:\s*"?(.*?)"?
This should match a fixed string image, followed by a colon, any number of spaces, if any, and anything between optional quotes.
Not a big deal, right? However Sublime is capturing the string image instead of what I've defined to be captured. Even if there are no spaces or quotes, it should at least match what's after the colon, not before it:
I did a fresh install, reinstalling and reconfiguring the very few plugins I use, trying to, maybe, get rid of any sort of caching, without luck.
And this is a major setback for me 'cause I can't do batch replacements all over a project.
There are only two things I did differently than my regular development routine:
Installed String 2 Lower Hyphen Plugin to speed-up the creation of some dashed separated URI slugs BUT when fresh installing I didn't add it back and the problem persisted.
For the first time, I used the expression <open files> to do a batch replacement in a specific set of files I had manually opened since they're in different directories.
Nothing more than that.
I can workaround the issue by changing the .*? to a .* but this is a palliative measure since I always used the non-greedy version without problems
Does anyone know what could be happening?

I'm not sure how your regex used to match any differently, let's think about what the regex is saying:
image: - the literal image:
\s* - any amount of whitespace, including none
"? - an optional quote
(.*?) - lazily capture anything except a newline character into capture group 1
"? an optional quote
So for your example text to match, it matches image: and the space after it, then, there is no quote, the next instruction is lazy so it captures nothing into capture group 1, then there is no quote, so that is the full extent of the match.
If you always want to capture the value in capture group 1, regardless of whether it was a quoted or unquoted string, you could instead consider using an expression like:
\bimage:\s*"?((?(?<=")[^"]*|.*$))"?
\b word boundary, to ignore image: not being the start of a word
image: literal image:
\s* any amount of whitespace, including none (depending on your source document and requirements, it may be better/more defensive to specify a literal space so that newlines won't be matched here)
"? optional quote
( begin capture group 1
(?(?<=") conditional - lookbehind to see if a quote matched
[^"]* if a quote matched, then match all non-quote characters (of course, we could also check for escaped quotes if your file is YAML format or similar, but URLs shouldn't contain quotes, so we're leaving it out as per the original regex.)
| otherwise, if the conditional didn't match, i.e. there was no quote after image:
.*$ match everything until the newline - again, if this is YAML, you may want to consider excluding comments etc.
) end conditional
) end capture group
"? optional quote (this will never match at $ if the conditional fails)

Using matched patterns in replace

I don't see the Sublime regex documentation saying anything about how to use matched patterns in the replace function. I tried to use the PHP/htaccess format of $0 (and $1 just in case the indexes start with 1), but no luck.
What I'm trying to do, is go through all my methods, and make static methods begin with an uppercase letter. So I would like to change all calls to Foo::bar() (PHP syntax) into Foo::Bar(). So even if I knew how to use the matched pattern (in this case b), is there a way to make it uppercase in the replace field?

These operators are described in the Boost regex library reference:
\u Causes the next character to be outputted, to be output in upper case.
So, you may use \u uppercase operator in the replacement pattern to make the first character after it uppercase.
Search: ::(\w+\(\))
Replace: ::\u$1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string