Custom Syntax Highlighting in Sublime Text 3 -- inline comments indicated by $ - sublimetext3

I'm trying to write custom syntax file for a code that uses 'C' or 'c' to indicate lines that are comments and '$' to indicate inline comments. Right now I have:
comments:
# Comments begin with a 'c' or C and finish at the end of the line.
- match: '\b(c|C|\\$)\b'
scope: punctuation.definition.comment.mcinp
push:
# This is an anonymous context push for brevity.
- meta_scope: comment.line.c.mcinp
- match: $\n?
pop: true
So:
A line that starts with an upper or lower case 'C' is a comment. Anything after a '$' in a line is a comment:
c this line is a comment
a = 1 $ anything on a line after a dollar sign is a comment
This doesn't change the highlighting of text after a $, so it must be wrong. I'd appreciate any insight on this.

I think your problem here is two-fold.
First, the construct \b matches a word boundary, but the boundary of a word is defined as the following (from this page):
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters don't include whitespace, so in order for the rule to trigger there needs to be a character before and after the $ in order for it to match.
The second issue is that \\$ isn't creating an escaped $ like you think it is, it's the escape character escaping itself (\\) followed by a literal $ that matches the end of the line. As such that regex can never match because it requires the next character after the end of the line to be a word character, which it can't be.
What you probably want is instead \$ to result in a literal $ character.
All combined, the example would look more like this:
# Comments begin with a 'c' or C and finish at the end of the line.
- match: '(?:\b[Cc]\b)|\$'
scope: punctuation.definition.comment.mcinp
push:
# This is an anonymous context push for brevity.
- meta_scope: comment.line.c.mcinp
- match: $\n?
pop: true
This moves the $ out of the bounds of the word boundary conditions you've defined, so it will match as appropriate.
As a side note, your question mentions that a line that starts with a C is a comment, but as defined a C anywhere in the line defines a comment, as long as it's a single word.
To get it to behave as your question describes, something like the following is more appropriate, which constrains the match on the C characters to being the first non-whitespace character on the line:
- match: '(?:^\s*[Cc]\b)|\$'

Related

Regular expression (^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$) meaning

Given a tweet dataset from this link which has a content column as follows:
I hope to add one new column to identify whether or not the tweet mentioned Trump. The regex patern (^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$) seems work out, but I don't understand well. I've tested with the code below:
Test1 gives the output since it's matched:
txt1 = "anti-Trump protesters"
re.search("(^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)", txt1)
Out:
<_sre.SRE_Match object; span=(4, 11), match='-Trump '>
Test2 return None since it's not matched as expected:
txt2 = 'I got Trumped'
re.search("(^|[^A-Za-z0-9])Trump([^A-Za-z0-9]|$)", txt2)
Someone could help to explain a little bit about this pattern. Many thanks at advance.
The (^|[^A-Za-z0-9]) portion has |, which means “or”. The left side, the ^, is the start of the string. The right side, [^A-Za-z0-9], matches any character that is not a letter or a number. In short, it matches when “Trump” is at the start of the string, or is preceded by a non-alphanumeric character.
The ([^A-Za-z0-9]|$) follows a similar pattern, where the left side matches any character that is not a letter or a number. The right side, the $ matches the end of the string. Likewise, it matches when “Trump” is at the end of the string or is followed by a non-alphanumeric character.
So, bottom line, it matches “Trump“ that is either at the start of the string or is preceded by any character that is not alphanumeric, as well as matches if it is also and the end of the string or is followed by a non-alphanumeric character.

Regular expression to extract text between multiple hyphens

I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Regex to find compensations in text

I need to find mentions of compensations in emails. I am new to regex. Please see below the approach I am using.
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
My python code to find this,
impor re
PATTERN = r'((\$|\£) [0-9]*)|((\$|\£)[0-9]*)'
print(re.findall(PATTERN,sample_text))
The matches I am getting is
[('', '', '$115', '$'), ('', '', '$55', '$'), ('', '', '$60', '$')]
Expected match
["$115k/yr","$55/hr","$60/hr"]
Also the $ sign can be written as USD. How do I handle this in the same regex.
You can use
[$£]\d+[^.\s]*
[$£] Match either $ or £
\d+ Match 1+ digits
[^.\s]* Repeat 0+ times matching any char except . or a whitespace
Regex demo
import re
sample_text = "Rate – $115k/yr. or $55/hr. - $60/hr"
PATTERN = r'[$£]\d+[^.\s]*'
print(re.findall(PATTERN,sample_text))
Output
['$115k/yr', '$55/hr', '$60/hr']
If there must be a / present, you might also use
[$£]\d+[^\s/.]*/\w+
Regex demo
You can have something like:
[$£]\d+[^.]+
>>> PATTERN = '[$£]\d+[^.]+
>>> print(re.findall(PATTERN,sample_text))
['$115k/yr', '$55/hr', '$60/hr']
[$£] matches "$" or a "£"
\d+ matches one or more digits
[^.]+ matches everything that's not a "."
The parentheses in your regex cause the engine to report the contents of each parenthesized subexpression. You can use non-grouping parentheses (?:...) to avoid this, but of course, your expression can be rephrased to not have any parentheses at all:
PATTERN = r'[$£]\s*\d+'
Notice also how I changed the last quantifier to a + -- your attempt would also find isolated currency symbols with no numbers after them.
To point out the hopefully obvious, \s matches whitespace and \s* matches an arbitrary run of whitespace, including none at all; and \d matches a digit.
If you want to allow some text after the extracted match, add something like (?:/\w+)? to allow for a slash and one single word token as an optional expression after the main match. (Maybe adorn that with \s* on both sides of the slash, too.)

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.
In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.
Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.
Demo on regex101 (requires JS that supports variable width lookbehinds)
The pattern you tried does not match because this part (\d)(?<!\1) can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured
digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Regex demo | Python demo
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44
#Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Demo
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat".
The internal string pointer is initially at the beginning of the line.
"c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
"c" is matched and saved to capture group 1.
The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
"a" fails the first two parts of the assertion so the third part is considered.
The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
The next character ("a") is matched and saved in capture group 3.
The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

A complicated case of conditional line splitting to be performed in Vim

Here is a sample of text that I’m working with:
Word1
Word2
...
Word4 / Word5 Word6
Word7
Word8 Word9 Word10 / Word11 Word12 Word13 Word14
Word15
Word16
...
I would like to transform it by splitting the lines containing
slash-separated chunks, so that the first chunk (preceding the slash)
gets the trailing words copied from the second chunk (following the
slash) to equalize the number of words in both lines resulting from
the chunks, if the former one has fewer words than the latter.
In other words, the desired transformation is to target the lines
consisting of two groups of words separated by a (space-surrounded)
slash character. The first group of words (preceding the slash) on
a target line has 1 to 3 words, but always fewer than the second
group.
Thus, the target lines have the following structure:
‹G1› / ‹G2› ‹G3›
where ‹G1› and
‹G2› ‹G3› (i.e.,
‹G2› concatenated with ‹G3›)
constitute the two aforementioned groups of words, with
‹G2› standing for as many of the leading words of the
after-slash group as there are in the before-slash one, and
‹G3› standing for the remaining words in the
after-slash group.
Such lines should be replaced with two lines, as follows:
‹G1› ‹G3›
‹G2› ‹G3›
For the above example, the desired result is as follows:
Word1
Word2
...
Word4 Word6
Word5 Word6
Word7
Word8 Word9 Word10 Word14
Word11 Word12 Word13 Word14
Word15
Word16
...
Could you please help me implement this transformation in Vim?
You can write a function to expand slash:
fun! ExpandSlash() range
for i in range(a:firstline, a:lastline)
let ws = split(getline(i))
let idx = index(ws, '/')
if idx==-1
continue
endif
let h= join(ws[ : idx-1])
let m= join(ws[idx+1 : 2*idx])
let t= join(ws[2*idx+1 : ])
call setline(i, h.' '.t.'/'.m.' '.t)
endfor
endfun
:%call ExpandSlash()
:%s#/#\r#
before
1 2 3 / 4 5 6 7 8
after
1 2 3 7 8
4 5 6 7 8
One can use the following command to perform the desired transformation:
:g~/~s~\s*/\s*~\r~|-|exe's/\ze\n\%(\s*\w\+\)\{'.len(split(getline('.'))).'}\(.*\)$/\1'
This :global command selects the lines matching the pattern /
(here, it is delimited by ~ characters) and executes the commands
that follow it for each of those lines.
Let us consider them one by one.
The slash character with optional surrounding whitespace that
separates the first and the second groups of words on the
current line (as defined in the question’s statement), is
replaced by the newline character:
:s~\s*/\s*~\r~
Here the tilde characters are used again to delimit the
pattern and the replacement strings, so that there is no'
need to escape the slash.
After the above substitution the cursor is located on the line
next to the one where the substituted slash was. To make writing
the following commands more convenient, the cursor is moved back
that line just above:
:-
The - address is the shortening for the .-1 range denoting
the line preceding the current one (see :help :range).
The third group of words, which is now at the end of the next
line, is to be appended to the current one. In order to do
that, the number of words in the first group is determined.
Since the current line contains the first group only, that
number can be calculated by separating the contents of that
line into whitespace-delimited groups with the help of the
split() function:
len(split(getline('.')))
The getline('.') call returns the current line as a string,
split() converts that string into a list of words, and
len() counts the number of items in that list.
Using the number of words, a substitution command is generated
and run with the :execute command:
:exe's/\ze\n\%(\s*\w\+\)\{'.len(split(getline('.'))).'}\(.*\)$/\1'
The substitutions have the following structure:
:s/\ze\n\%(\s*\w\+\)\{N}\(.*\)$/\1
where N is the number of words that were placed before
the slash.
The pattern matches the newline character of the current line
followed by exactly N words on the second line. A word
is matched as a sequence of whitespace preceding a series of
one or more word characters (see :help /\s and :help /\w).
The word pattern is enclosed between the \%( and \)
escaped parentheses (see :help /\%() to treat it as a single
atom for the \{N} specifier (see :help /\{) to match
exactly N occurrences of it. The remaining text to the
end of the next line is matched as a subgroup to be referenced
from the replacement expression.
Because of the \ze atom at the very beginning of the
pattern, its match has zero width (see :help /\ze). Thanks
to that, the substitution command replaces the empty string
just before the newline character with the text matched by the
subgroup, thus inserting the third group of words after the
first one.
For the given example the result is equivalent to replacing each / with the last word on the line and a line break \r. Here is a global substitute command to do it:
:%s#/ \ze.*\(\<\w\+$\)#\1\r#
Explanation:
/ \ze match the / end stop matching (nothing after the \ze will be substituted)
.* match any intermediate characters
\( start another match group
\<\w\+$ match the last word before the end of the line
\) stop the match group
However, you then say that the trailing group g3 may contain more than one word, which means the replace operation needs to be able to count the number of words before and after the /. I'm afraid I don't know how to do that, but I'm sure someone will leap to your rescue before long!

Resources