^[:blank:] does not match dot in sed - linux

I have an input as follows:
INa.aa................... October 2010 after its previous U.S.-based owners failed to pay debts
My goal is to put brackets around every word starting with letter i/I. So I issued a command:
sed 's/\<i[^[:blank:]]*\>/(&)/gi' input_data
Which returned this output:
(INa.aa)................... October 2010 after (its) previous U.S.-based owners failed to pay debts
What I don't get is, why doesn't the ^[:blank:]* also include the dots after INa.aa?
Thank you for any suggestions.

You use the \> "end of word" escape. A word boundary is defined as
the character to the left is a "word" character and the character to the right is a "non-word" character, or vice-versa
in the manual (referring to \b). In the case of \>, the "vice-versa" does not apply.
What is a "word" character?
A "word" character is any letter or digit or the underscore character.
And "non-word" are all the others. You expect the boundary between your periods and a blank to match \>, but it doesn't: both the period and the blank are non-word characters. The word boundary is between the last a and the first ..
The period between the as is also surrounded by word boundaries, but because there aren't any blanks involved, it's a part of the match.
If you want to match everything up to the next blank, you can just skip the \> in your regex.

Related

Mark words in notepad++ including dash (-)

I would like to mark in Notepad++ the sql scripts in a text log. The sql files have this format in the text:
AAAAAAAA.BBBBBBBBBBB.sql
So what I execute is this sentence in search menu:
\w*.sql
As I should get BBBBBBBBBBB.sql. The point is that in some script names there are dashes (-), and when that happens I dont get the whole name, but just the end after the last dash.
For example, in:
AAAAAAAA.BBBBB-CCCCCCC.sql
I would like to get BBBBB-CCCCCCC.sql, but I just get CCCCCCC.sql
Is there any possible formula to get them?
If the match can not start and end with a hyphen:
\w+(?:-\w+)*\.sql
\w+ Match 1+ word characters
(?:-\w+)* Optionally match - and 1+ word characters
\.sql Match .sql
See a regex demo.
Note that in your pattern the \w* can also match 0 occurrences and that the . can match any character if it is not escaped.
Another option could be using a character class to match either - or a word character, but this would also allow to mix and match like --a--.sql
[\w-]+\.sql
See another regex demo.

What do you understand by this RegEx?

I´m working with VBA and trying to split a string into three columns, almost all strings are like Company Name 3567782 Agent Name.pdf
With this pattern I want to match all the text before a space and digits (1st group), the digits (2nd group) and all the text after the space and before the .pdf (3rd group).
strPattern = "^(.+)\n(\d{4,10})\n(.+).pdf"
I recall spaces in python are \s but saw in VBA are \n.
Can you help me find the right pattern for what I´m looking for?
As I put in my comment, I use the https://regex101.com site. There are others but I find this one the most helpful to me.
When I put in your regex
^(.+)\n(\d{4,10})\n(.+).pdf
and test string
Company Name 3567782 Agent Name.pdf
the first thing I notice is that the regex does not match the test string (see right side under MATCH INFORMATION).
Here are a couple things that I saw:
\n is newline, not space. In regex, space is " ".
Your last "." in ".pdf" is not registering as a literal period, it's a token that matches any character. To match a literal period, you need \.
If we change those two things it returns three groups that seem to match what you are looking for.
^(.+) (\d{4,10}) (.+)\.pdf
It looks like for the digits, you are looking for between 4 and 10 digits. If that's correct, it looks like your regex is good. You could put in a handful of example strings into the TEST STRING area and make sure that it works in all cases.
I'd use either of these:
(?:(?:([a-zA-Z]+\.?)|(\d+)))
capture a-Z greedy with a possible . to allow for the .pdf or capture digits
this version excludes the space [ ] or \s
or keep the search structured so you can control what goes in and out of each column
^(\w+\s\w+)|(\d+)|(\w+\s\w+\.\w+$)
\b or ^ - word boundary or start of string
(\w+\s\w+) - 1st capture \w+ - any alpha numeric char greedily, followed by 1 x space (use \s* or \s+ for more), followed again by alpha numeric greedily
|(\d+) - alteration - \d+ - capture just digits
`|(\w+\s\w+.\w+$) - similar to 1st group but allows for the '.' of pdf and bounds to the end of string (\G or $).
you could optionally build the '.' into the 1st group like my top answer, but for neatness and better control I prefer the 2nd.

How to change a specific colum content strings using bash/shell?

I'm having a .txt file looking like this (along about 400 rows):
lettuceFMnode_1240 J_C7R5_99354_KNKSR3_Oligomycin 81.52
lettuceFMnode_3755 H_C1R3_99940_KNKSF2_Tubulysin 70
lettuceFMnode_17813 G_C4R5_80184_KNKS113774F_Tetronasin 79.57
lettuceFMnode_69469 J_C11R7_99276_KNKSF2_Nystatin 87.27
I want to edit the names in the entire 2nd column so that only the last part will stay (meaning delete anything before that, so in fact leaving what comes after the last _).
I looked into different solutions using a combination of cut and sed, but couldn't understand how the code should be built.
Would appreciate any tips and help!
Thank you!
Here's one way:
perl -pe 's/^\S+\s+\K\S+_//'
For every line of input (-p) we execute some code (-e ...).
The code performs a subtitution (s/PATTERN/REPLACEMENT/).
The pattern matches as follows:
^ beginning of string
\S+ 1 or more non-whitespace characters (the first column)
\s+ 1 or more whitespace characters (the space after the first column)
\K do not treat the text matched so far as part of the final match
\S+ 1 or more non-whitespace characters (the second column)
_ an underscore
Because + is greedy (it matches as many characters as possible), \S+_ will match everything up to the last _ in the second column.
Because we used \K, only the rest of the pattern (i.e. the part of the match that lies in the second column) gets replaced.
The replacement string is empty, so the match is effectively removed.
With sed:
sed 's/ [^ ]*_/ /' file
Replace first space followed by non-space characters ([^ ]*) followed by _ widh one space.

How do I put several characters after the first letter and the last letter by use of Vim?

How do I put several characters after the first letter and the last letter in the whole text by use of Vim?
E.g. I need to put {{c1:: after the first letter and }} after the last letter. Also, I want to ignore two-letter words.
You mean in every word? Try this:
:%s/\<\(\w\)\(\w\w\+\)\>/\1{{c1::\2}}/g
That will replace every first character in a word with the first character followed by {{c1:: and add }} at the end of it. Words shorter than three characters are ignored.
If your words contain more than just [a-zA-Z0-9], then replace \w by a more appropriate character class.

Vim: word vs WORD

I'm learning Vim and can't wrap my head around the difference between word and WORD.
I got the following from the Vim manual.
A word consists of a sequence of letters, digits and underscores, or a
sequence of other non-blank characters, separated with white space
(spaces, tabs, ). This can be changed with the 'iskeyword'
option. An empty line is also considered to be a word.
A WORD consists of a sequence of non-blank characters, separated with
white space. An empty line is also considered to be a WORD.
I feel word and WORD are just the same thing. They are both a sequence of non-blank chars separated with white spaces. An empty line can be considered as both word and WORD.
Question:
What's the difference between them?
And why/when would someone use WORD over word?
I've already done Google and SO search, but their search-engine interpret WORD as just word so it's like I'm searching for Vim word vs word and of course won't find anything useful.
A WORD is always delimited by whitespace.
A word is delimited by non-keyword characters, which are configurable. Whitespace characters aren't keywords, and usually other characters (like ()[],-) aren't, neither. Therefore, a word usually is smaller than a WORD; the word-navigation is more fine-grained.
Example
This "stuff" is not-so difficult!
wwww wwwww ww www ww wwwwwwwww " (key)words, delimiters are non-keywords: "-! and whitespace
WWWW WWWWWWW WW WWWWWW WWWWWWWWWW " WORDS, delimiters are whitespace only
To supplement the previous answers... I visualise it like this; WORD is bigger than word, it encompasses more...
If I do viw ("select inner word") while my cursor is on app in the following line, it selects app:
app/views/layouts/admin.blade.php
If I do viW (WORD) while my cursor is at the same place, it selects the whole sequence of characters. A WORD includes characters that words, which are like English words, do not, such as asterisks, slashes, parentheses, brackets, etc.
According to Vim documentation ( :h 03.1 )
A word ends at a non-word character, such as a ".", "-" or ")".
A WORD ends strictly with a white-space. This may not be a word in normal sense, hence the uppercase.
eg.
ge b w e
<- <- ---> --->
This is-a line, with special/separated/words (and some more). ~
<----- <----- --------------------> ----->
gE B W E
If your cursor is at m (of more above)
a word would mean 'more' (i.e delimited by ')' non-word character)
whereas a WORD would mean 'more).' (i.e. delimited by white-space only)
similarly, If your cursor is at p (of special)
a word would mean 'special'
whereas a WORD would mean 'special/separated/words'
That's a grammar problem while understanding the definition of "word".
I get stuck at first in Chinese version of this definition (could be miss-translation).
The definition is definitely correct, but it should be read like that:
A word consists of:
[(a sequence of letters,digits and underscores),
or (a sequence of other non-blank characters)],
separated with white space (spaces, tabs, <EOL>).
Whitespace characters were only needed when delimiting two same types of 'word'
More examples in brackets as follow:
(example^&$%^Example) three "word" :(example), (^&$%^) and (Example)
(^&^&^^ &&^&^) two "word" : (^&^&^^) and (&&^&^)
(we're in stackoverflow) five "word" :(we), ('), (re), (in) and (stackoverflow)
Another way to say it. If ur coding, and want to move thru the line stopping at delimiters and things line that "() . [] , :" use w.
if you want to bypass those and just jump to words lets say like a novel or short story has, use W.
For coding the small w is probably the one used most often. Depends where you are in the code.

Resources