A complicated case of conditional line splitting to be performed in Vim - vim

Here is a sample of text that I’m working with:
Word1
Word2
...
Word4 / Word5 Word6
Word7
Word8 Word9 Word10 / Word11 Word12 Word13 Word14
Word15
Word16
...
I would like to transform it by splitting the lines containing
slash-separated chunks, so that the first chunk (preceding the slash)
gets the trailing words copied from the second chunk (following the
slash) to equalize the number of words in both lines resulting from
the chunks, if the former one has fewer words than the latter.
In other words, the desired transformation is to target the lines
consisting of two groups of words separated by a (space-surrounded)
slash character. The first group of words (preceding the slash) on
a target line has 1 to 3 words, but always fewer than the second
group.
Thus, the target lines have the following structure:
‹G1› / ‹G2› ‹G3›
where ‹G1› and
‹G2› ‹G3› (i.e.,
‹G2› concatenated with ‹G3›)
constitute the two aforementioned groups of words, with
‹G2› standing for as many of the leading words of the
after-slash group as there are in the before-slash one, and
‹G3› standing for the remaining words in the
after-slash group.
Such lines should be replaced with two lines, as follows:
‹G1› ‹G3›
‹G2› ‹G3›
For the above example, the desired result is as follows:
Word1
Word2
...
Word4 Word6
Word5 Word6
Word7
Word8 Word9 Word10 Word14
Word11 Word12 Word13 Word14
Word15
Word16
...
Could you please help me implement this transformation in Vim?

You can write a function to expand slash:
fun! ExpandSlash() range
for i in range(a:firstline, a:lastline)
let ws = split(getline(i))
let idx = index(ws, '/')
if idx==-1
continue
endif
let h= join(ws[ : idx-1])
let m= join(ws[idx+1 : 2*idx])
let t= join(ws[2*idx+1 : ])
call setline(i, h.' '.t.'/'.m.' '.t)
endfor
endfun
:%call ExpandSlash()
:%s#/#\r#
before
1 2 3 / 4 5 6 7 8
after
1 2 3 7 8
4 5 6 7 8

One can use the following command to perform the desired transformation:
:g~/~s~\s*/\s*~\r~|-|exe's/\ze\n\%(\s*\w\+\)\{'.len(split(getline('.'))).'}\(.*\)$/\1'
This :global command selects the lines matching the pattern /
(here, it is delimited by ~ characters) and executes the commands
that follow it for each of those lines.
Let us consider them one by one.
The slash character with optional surrounding whitespace that
separates the first and the second groups of words on the
current line (as defined in the question’s statement), is
replaced by the newline character:
:s~\s*/\s*~\r~
Here the tilde characters are used again to delimit the
pattern and the replacement strings, so that there is no'
need to escape the slash.
After the above substitution the cursor is located on the line
next to the one where the substituted slash was. To make writing
the following commands more convenient, the cursor is moved back
that line just above:
:-
The - address is the shortening for the .-1 range denoting
the line preceding the current one (see :help :range).
The third group of words, which is now at the end of the next
line, is to be appended to the current one. In order to do
that, the number of words in the first group is determined.
Since the current line contains the first group only, that
number can be calculated by separating the contents of that
line into whitespace-delimited groups with the help of the
split() function:
len(split(getline('.')))
The getline('.') call returns the current line as a string,
split() converts that string into a list of words, and
len() counts the number of items in that list.
Using the number of words, a substitution command is generated
and run with the :execute command:
:exe's/\ze\n\%(\s*\w\+\)\{'.len(split(getline('.'))).'}\(.*\)$/\1'
The substitutions have the following structure:
:s/\ze\n\%(\s*\w\+\)\{N}\(.*\)$/\1
where N is the number of words that were placed before
the slash.
The pattern matches the newline character of the current line
followed by exactly N words on the second line. A word
is matched as a sequence of whitespace preceding a series of
one or more word characters (see :help /\s and :help /\w).
The word pattern is enclosed between the \%( and \)
escaped parentheses (see :help /\%() to treat it as a single
atom for the \{N} specifier (see :help /\{) to match
exactly N occurrences of it. The remaining text to the
end of the next line is matched as a subgroup to be referenced
from the replacement expression.
Because of the \ze atom at the very beginning of the
pattern, its match has zero width (see :help /\ze). Thanks
to that, the substitution command replaces the empty string
just before the newline character with the text matched by the
subgroup, thus inserting the third group of words after the
first one.

For the given example the result is equivalent to replacing each / with the last word on the line and a line break \r. Here is a global substitute command to do it:
:%s#/ \ze.*\(\<\w\+$\)#\1\r#
Explanation:
/ \ze match the / end stop matching (nothing after the \ze will be substituted)
.* match any intermediate characters
\( start another match group
\<\w\+$ match the last word before the end of the line
\) stop the match group
However, you then say that the trailing group g3 may contain more than one word, which means the replace operation needs to be able to count the number of words before and after the /. I'm afraid I don't know how to do that, but I'm sure someone will leap to your rescue before long!

Related

Regular expression to extract text between multiple hyphens

I have a string in the following format
----------some text-------------
How do I extract "some text" without the hyphens?
I have tried with this but it matches the whole string
(?<=-).*(?=-)
The pattern matches the whole line except the first and last hyphen due to the assertions on the left and right and the . also matches a hyphen.
You can keep using the assertions, and match any char except a hyphen using [^-]+ which is a negated character class.
(?<=-)[^-]+(?=-)
See a regex demo.
Note: if you also want to prevent matching a newline you can use (?<=-)[^-\r\n]+(?=-)
With your shown samples please try following regex.
^-+([^-]*)-+$
Online regex demo
Explanation: Matching all dashes from starting 1 or more occurrences then creating one and only capturing group which has everything matched till next - comes, then matching all dashes till last of value.
You are using Python, right? Then you do not need regex.
Use
s = "----------some text-------------"
print(s.strip("-"))
Results: some text.
See Python proof.
With regex, the same can be achieved using
re.sub(r'^-+|-+$', '', s)
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
-+ '-' (1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
$ before an optional \n, and the end of the
string

Regular expression to capture n lines of text between two regex patterns

Need help with a regular expression to grab exactly n lines of text between two regex matches. For example, I need 17 lines of text and I used the example below, which does not work. I
Please see sample code below:
import re
match_string = re.search(r'^.*MDC_IDC_RAW_MARKER((.*?\r?\n){17})Stored_EGM_Trigger.*\n'), t, re.DOTALL).group()
value1 = re.search(r'value="(\d+)"', match_string).group(1)
value2 = re.search(r'value="(\d+\.\d+)"', match_string).group(1)
print(match_string)
print(value1)
print(value2)
I added a sample string to here, because SO does not allow long code string:
https://hastebin.com/aqowusijuc.xml
You are getting false positives because you are using the re.DOTALL flag, which allows the . character to match newline characters. That is, when you are matching ((.*?\r?\n){17}), the . could eat up many extra newline characters just to satisfy your required count of 17. You also now realize that the \r is superfluous. Also, starting your regex with ^.*? is superfluous because you are forcing the search to start from the beginning but then saying that the search engine should skip as many characters as necessary to find MDC_IDC_RAW_MARKER. So, a simplified and correct regex would be:
match_string = re.search(r'MDC_IDC_RAW_MARKER.*\n((.*\n){17})Stored_EGM_Trigger.*\n', t)
Regex Demo

Why doesn't this RegEx match anything?

I've been trying for about two hours now to write a regular expression which matches a single character that's not preceded or followed by the same character.
This is what I've got: (\d)(?<!\1)\1(?!\1); but it doesn't seem to work! (testing at https://regex101.com/r/whnj5M/6)
For example:
In 1111223 I would expect to match the 3 at the end, since it's not preceded or followed by another 3.
In 1151223 I would expect to match the 5 in the middle, and the 3 at the end for the same reasons as above.
The end goal for this is to be able to find pairs (and only pairs) of characters in strings (e.g. to find 11 in 112223 or 44 in 123544) and I was going to try and match single isolated characters, and then add a {2} to it to find pairs, but I can't even seem to get isolated characters to match!
Any help would be much appreciated, I thought I knew RegEx pretty well!
P.S. I'm testing in JS on regex101.com because it wouldn't let me use variable length lookbacks in Python on there, and I'm using the regex library to allow for this in my actual implementation.
Your regex is close, but by using simply (\d) you are consuming characters, which prevents the other match from occurring. Instead, you can use a positive lookahead to set the capture group and then test for any occurrences of the captured digit not being surrounded by copies of itself:
(?=.*?(.))(?<!\1)\1(?!\1)
By using a lookahead you avoid consuming any characters and so the regex can match anywhere in the string.
Note that in 1151223 this returns 5, 1 and 3 because the third 1 is not adjacent to any other 1s.
Demo on regex101 (requires JS that supports variable width lookbehinds)
The pattern you tried does not match because this part (\d)(?<!\1) can not match.
It reads as:
Capture a digit in group 1. Then, on the position after that captured
digit, assert what is captured should not be on the left.
You could make the pattern work by adding for example a dot after the backreference (?<!\1.) to assert that the value before what you have just matched is not the same as group 1
Pattern
(\d)(?<!\1.)\1(?!\1)
Regex demo | Python demo
Note that you have selected ECMAscript on regex101.
Python re does not support variable width lookbehind.
To make this work in Python, you need the PyPi regex module.
Example code
import regex
pattern = r"(\d)(?<!\1.)\1(?!\1)"
test_str = ("1111223\n"
"1151223\n\n"
"112223\n"
"123544")
matches = regex.finditer(pattern, test_str)
for matchNum, match in enumerate(matches, start=1):
print(match.group())
Output
22
11
22
11
44
#Theforthbird has provided a good explanation for why your regular explanation does not match the characters of interest.
Each character matched by the following regular expression is neither preceded nor followed by the same character (including characters at the beginning and end of the string).
r'^.$|^(.)(?!\1)|(?<=(.))(?!\2)(.)(?!\3)'
Demo
Python's re regex engine performs the following operations.
^.$ match the first char if it is the only char in the line
| or
^ match beginning of line
(.) match a char in capture group 1...
(?!\1) ...that is not followed by the same character
| or
(?<=(.)) save the previous char in capture group 2...
(?!\2) ...that is not equal to the next char
(.) match a character and save to capture group 3...
(?!\3) ...that is not equal to the following char
Suppose the string were "cat".
The internal string pointer is initially at the beginning of the line.
"c" is not at the end of the line so the first part of the alternation fails and the second part is considered.
"c" is matched and saved to capture group 1.
The negative lookahead asserting that "c" is not followed by the content of capture group 1 succeeds, so "c" is matched and the internal string pointer is advanced to a position between "c" and "a".
"a" fails the first two parts of the assertion so the third part is considered.
The positive lookbehind (?<=(.)) saves the preceding character ("c") in capture group 2.
The negative lookahead (?!\2), which asserts that the next character ("a") is not equal to the content of capture group 2, succeeds. The string pointer remains just before "a".
The next character ("a") is matched and saved in capture group 3.
The negative lookahead (?!\3), which asserts that the following character ("t") does not equal the content of capture group 3, succeeds, so "a" is matched and the string pointer advances to just before "t".
The same steps are performed when evaluating "t" as were performed when evaluating "a". Here the last token ((?!\3)) succeeds, however, because no characters follow "t".

Custom Syntax Highlighting in Sublime Text 3 -- inline comments indicated by $

I'm trying to write custom syntax file for a code that uses 'C' or 'c' to indicate lines that are comments and '$' to indicate inline comments. Right now I have:
comments:
# Comments begin with a 'c' or C and finish at the end of the line.
- match: '\b(c|C|\\$)\b'
scope: punctuation.definition.comment.mcinp
push:
# This is an anonymous context push for brevity.
- meta_scope: comment.line.c.mcinp
- match: $\n?
pop: true
So:
A line that starts with an upper or lower case 'C' is a comment. Anything after a '$' in a line is a comment:
c this line is a comment
a = 1 $ anything on a line after a dollar sign is a comment
This doesn't change the highlighting of text after a $, so it must be wrong. I'd appreciate any insight on this.
I think your problem here is two-fold.
First, the construct \b matches a word boundary, but the boundary of a word is defined as the following (from this page):
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Word characters don't include whitespace, so in order for the rule to trigger there needs to be a character before and after the $ in order for it to match.
The second issue is that \\$ isn't creating an escaped $ like you think it is, it's the escape character escaping itself (\\) followed by a literal $ that matches the end of the line. As such that regex can never match because it requires the next character after the end of the line to be a word character, which it can't be.
What you probably want is instead \$ to result in a literal $ character.
All combined, the example would look more like this:
# Comments begin with a 'c' or C and finish at the end of the line.
- match: '(?:\b[Cc]\b)|\$'
scope: punctuation.definition.comment.mcinp
push:
# This is an anonymous context push for brevity.
- meta_scope: comment.line.c.mcinp
- match: $\n?
pop: true
This moves the $ out of the bounds of the word boundary conditions you've defined, so it will match as appropriate.
As a side note, your question mentions that a line that starts with a C is a comment, but as defined a C anywhere in the line defines a comment, as long as it's a single word.
To get it to behave as your question describes, something like the following is more appropriate, which constrains the match on the C characters to being the first non-whitespace character on the line:
- match: '(?:^\s*[Cc]\b)|\$'

How to replace finding words with the different in each occurrence in VI/VIM editor ?

For example, I have a text ,
10 3 4 2 10 , 4 ,10 ....
No I want to change each 10 with different words
I know %s/10/replace-words/gc but it only let me replace interactively like yes/no but I want to change each occurrence of 10 with different words like replace1, 3, 4 , 2 , replace2, 4, replace3 ....
Replaces each occurence of 10 with replace{index_of_match}:
:let #a=1 | %s/10/\='replace'.(#a+setreg('a',#a+1))/g
Replaces each occurence of 10 with a word from a predefined array:
:let b = ['foo', 'bar', 'vim'] | %s/10/\=(remove(b, 0))/g
Replaces each occurence of 10 with a word from a predefined array, and the index of the match:
:let #a=1 | let b = ['foo', 'bar', 'vim'] | %s/10/\=(b[#a-1]).(#a+setreg('a',#a+1))/g
But since you have to type in any word anyway, the benefit of the second and third function this is minimal. See the answer from SpoonMeiser for the "manual" solution.
Update: As wished, the explanation for the regex part in the second example:
%= on every line in the document
s/<search>/<replace>/g = s means do a search & replace, g means replace every occurence.
\= interprets the following as code.
remove(b, 0) removes the element at index 0 of the list b and returns it.
so for the first occurrence. the line will be %s/10/foo/g the second time, the list is now only ['bar', 'vim'] so the line will be %s/10/bar/g and so on
Note: This is a quick draft, and unlikely the best & cleanest way to achieve it, if somebody wants to improve it, feel free to add a comment
Is there a pattern to the words you want or would you want to type each word at each occurrence of the word you're replacing?
If I were replacing each instance of "10" with a different word, I'd probably do it somewhat manually:
/10
cw
<type word>ESC
ncw
<type word>ESC
ncw
<type word>ESC
Which doesn't seem too onerous, if each word is different and has to be typed separately anyway.

Resources