Match any character (including whitespace) until the LAST bunch of whitespaces - node.js

I've got such text:
0000 10 [STUFF] Text ("TOTAL,SOME RANDOM TEXT") (558b6a68)
The first two column is pretty static. The third is optional. The last is optional and if exists, then always covered between parenthesis.
My issue is with the forth column, which can have spaces or actually any character inside (except newline of course).
My current regex looks like this:
^([a-fA-F0-9]{4,})\s+[a-fA-F0-9]+\s+(?:\[[^\]]*\]\s+)?
It matches all until the beginning of the fourth column.
Please note that space might exist anywhere, I can't define exact locations, like "always before parenthesis" or "may be between quotation marks".
I know for sure that this is the column before the last. So I'd like to capture them like this:
0000 10 [STUFF] Text("TOTAL,SOME RANDOM TEXT") (558b6a68)
^ ^ ^ ^ ^ ^
CAPTURE C A P T U R E C A P T U R E
I'd like to capture the texts marked between ^ ^ characters mentioned in the previous code block.
So, I'd like to grab any character UNTIL the last bunch of whitespace but also I don't want to include them into the final match group.
I hope I described it well :) Is it posssible with regex at all?
Here is some more sample text to test on:
0000 10 Text("TOTAL,SOME RANDOM TEXT") (1122aabb)
0010 5 D==1122aabb (1122aabb)
0015 17 Text("AND,SOME,MORE") (00000001)
002c 5 D==1 (1)
0031 1 !D (ccdd3344)
0032 5 D==ccdd3344 (ccdd3344)
0037 2 !1 (1)
0039 0 [AAAA] Fff
0039 1 [BBBB] Aaa
003a 6 N(05, eeff5566) (eeff5566)
0040 1 Qq
0041 2 $ab ([String]:"Unknown")
0043 f Call A/SomeFunc-X
0052 1 cd

I'd also start similar like your pattern with something like ^(\w+) +\w+ +(?:\[[^\]]+\] *)?
From here (start of 4th column) capture the first \S non white space followed by .*? lazily any amount of any character until an optional parenthesized part at the $ end can be captured. If not, the full line is consumed by group two.
^(\w+) +\w+ +(?:\[[^\]]+\] *)?(\S.*?)(?: +(\([^)]+\)))?$
See this demo at regex101
Feel free to adjust the parenthesis of the third group to only capture what's inside if needed.

Related

Looking for a Regex which can find all the number combinaitions without having 3 zero's in between and mixed with delimeters

I would like to find all the number combinaitions without having 3 zero's in between.
There might be some delimiters (max 2 characters) in between the numbers.
I'm using python and I would like to perform this search with the regex.
Accepted numbers
This is number 1234 which should be accepted.
12-45
1 2 0 0 3 4 5
not accepted numbers:
1
12
123
1000
1000-2000
30000-31000
21 000-32 000-50 000
21 00 03 00 00
The regex with which I could come up is:
([\s\-]{0,2}\d(?!000)){4,}
My regex can find all the accepted numbers but it doesn't filter out all the excepted numbers.
See the results in regex
Actually this regex is used in python to remove the matched numbers from the text:
See python code
p.s. Delimiters are not only space but should be at least \s and dash.
p.s.s. The numbers might be in the middle of the string. So I think I cannot use ^ and $ in my regex.
You could assert not 3 zeroes in a row while matching optional delimiters in between.
\b(?![\d\s-]*?0(?:[\s-]*0){2})\d(?:[\s-]*\d){3,}\b
Explanation
\b A word boundary
(?! Negative lookahead, assert what is at the right is not
[\d\s-]*? Match any of a digit, whitespace char or - as least as possible
0(?:[\s-]*0){2} - ) Match a zere followed by 2 times a zero with optional delimiters in between
\d Match a digit
(?:[\s-]*\d){3,} Repeat 3 or more times matching a digit with optional delimiters in between
\b A word boundary
Regex demo

removed entry based on the length of the values

If the last column consists of less than 2 values then the whole row will be removed
sample data:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
5413|1.0.0.0/16|789
2152|1.4.0.0/16|745 89 1876
3549|1.0.8.0/22|680
Expected output:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
2152|1.4.0.0/16|745 89 1876
Is there any way to do it?
If there are no spaces after the single value, you can just eliminate lines with no space after the last |:
grep -v '|[^ |]*$'
[...] is a character class. [ |] matches a space or |.
^ inside a character class negates it, i.e. [^ |] matches anything but space or |.
* means "repeated zero or more times"
$ matches the end of line
-v shows the lines not matching the pattern
So the whole thing means "skip lines that contain vertical bar followed by characters different to space and vertical bar till the end of the line"
It doesn't work for your sample data, though, as there's a space after 789. So, check there's a space followed by non-space after the last |:
grep '|[^ |]\+ [^ |]\+'
Here, \+ means "repated one or more times".
Short awk solution:
awk -F'\\|' 'split($NF,a," ")>=2' file
The output:
18106|1.0.4.0/22|223 121 1836
3549|1.0.10.0/24|421 21
2152|1.4.0.0/16|745 89 1876
split($NF,a," ") - split the last field by space and returns the number of chunks

Linux Shell - Grep command

I'm having a problem using grep with these options: \{n\} , \{n,\} and \{n,m\} . I have a file named "new" with this lines:
aaaa
aaa
aa
a
When i use grep 'a\{1\}' new i get this output:
aaaa
aaa
aa
a
So, basically, this command will show me the lines that include 1, or more, consecutive occurrences of the character "a" right?
Also, grep 'a\{1,\} new will do the same thing as grep 'a\{1\}' new ? Because i get the same output for both.
The last one, \{n,m\} , i cant really understand what it does.
I would really appreciate if anyone could help me out.
From man grep:
Repetition
A regular expression may be followed by one of several repetition
operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more
than m times.
That example grep 'a\{2,3\}' new matches also the line with aaaa because of the first three (or 2) a. The rest of the line isn't important.
If you want that really only 2 or 3 consecutive a are matched, you could use the -o flag. But be aware that this would output aa and aaa from a line with aaaaa. To avoid this you have to use additional information, like in the example line breakings ^ and $.
Btw. I would suggest to use the -E flag (or egrep which is the same) so you have extended regex support. You don't have to escape the brackets then.
For input
aaaaa
aaaa
aaa
aa
a
a call of grep -o -E '^a{2,3}$' will give the output:
aaa
aa
grep 'a\{n,m\}' new means grepping at least n number of a and at most m number of a from new.
For example, grep 'a\{2,3\}' new will output
aaaa
aaa
aa
the last line doesnot match because it only has ONE a.
For a{n,\}' , omitting m means any number larger than or equal to n.

How to search/replace special chars?

After a copy-paste from Wikipedia into Vim, I get this:
1 A
2
3 [+] Métier agricole<200e> – 44 P • 2 C
4 [×] Métier de l'ameublement<200e> – 10 P
5 [×] Métier de l'animation<200e> – 5 P
6 [+] Métier en rapport avec l'art<200e> – 11 P • 4 C
7 [×] Métier en rapport avec l'automobile<200e> – 10 P
8 [×] Métier de l'aéronautique<200e> – 15 P
The problem is that <200e> is only a char.
I'd like to know how to put it in a search/replace (via the / or :).
Check the help for \%u:
/\%d /\%x /\%o /\%u /\%U E678
\%d123 Matches the character specified with a decimal number. Must be
followed by a non-digit.
\%o40 Matches the character specified with an octal number up to 0377.
Numbers below 040 must be followed by a non-octal digit or a non-digit.
\%x2a Matches the character specified with up to two hexadecimal characters.
\%u20AC Matches the character specified with up to four hexadecimal
characters.
\%U1234abcd Matches the character specified with up to eight hexadecimal
characters.
These are sequences you can use. Looks like you have two bytes, so \%u200e
should match it. Anyway, it's pretty strange. 20 in UTF-8 / ASCII is the space
character, and 0e is ^N. Check your encoding settings.
replace ^#
:%s/\%x00//g
replace ^L
// Enter the ^L using ctrl-V ctrl-L
:%s/^L//g
refers:
gvim - How to remove this symbol "^#" with vim? - Super User
vim - Deleting form feed ^L characters - Stack Overflow
If you want to quickly select this extraneous character everywhere and replace it / get rid of it, you could:
isolate one of the strange characters by adding a space before and after it, so it becomes a "word"
use the * command to search for the word under the cursor. If you have set hlsearch on, you should then see all of the occurrences of the extraneous character highlighted.
replace last searched item by something else, globally:
:%s//something else/

Matching only a <tab> that is between two numbers

How to match a tab only when it is between two numbers?
Sample script
209.65834 27.23204908
119.37987 15.03317082
74.240635 8.30561924
29.1014 0
931.8861 -100.00000
-16.03784 -8.30562
;
_mirror
l
;
29.1014 0
1028.10 0.00
n
_spline
935.4875 250
924.2026913 269.8820375
912.9178825 277.4506484
890.348265 287.3181854
(in the above script, the tabs are between the numbers, not the spaces) (blank lines are significant; there is nothing in them, but I can't lose them)
I wish to get a "," between the numbers. Tried with :%s/\t/\,/ but that will touch the empty lines too, and the end of lines.
Try this:
:%s/\(\d\)\t\(-\?\d\)/\1,\2/
\d matches any digit. -? means "an optional -. The pair of (escaped) parenthesis capture the match, and \1 refers to the first captured match, \2 refers to the second.
google://vim+regex -> http://vimregex.com/ ->
:%s/\([0-9]\)\t\([0-9]\)/\1,\2/gc
You have 2 groups of numbers here ([0-9]) and tab-symbols \t between them. Add some escape symbols and you have the answer.
g for multichange in single line, c for some asking.
\1 and \2 are matching groups (numbers in your case).
It's not really hard to find answer for questions like that by yourself.
try
:%s/\([0-9]\)\t\([0-9]\)/\1,\2/g
explanation - search the patten <digit>\t<digit> and remember the part that matches <digit> .
\( ... \) captures and remembers the part that matches.
\1 recalls the first captured digit, \2 the second captured digit.
so if the match was on 123\t789, <digit>,<digit> matches 3\t7
the 3 and 7 are rememberd as \1 and \2
or
:g/[0-9]/ s/\t/,/g
explanation - filter all lines with a digit, then substitute tabs with a comma on those lines

Resources