How to change a specific colum content strings using bash/shell? - text

I'm having a .txt file looking like this (along about 400 rows):
lettuceFMnode_1240 J_C7R5_99354_KNKSR3_Oligomycin 81.52
lettuceFMnode_3755 H_C1R3_99940_KNKSF2_Tubulysin 70
lettuceFMnode_17813 G_C4R5_80184_KNKS113774F_Tetronasin 79.57
lettuceFMnode_69469 J_C11R7_99276_KNKSF2_Nystatin 87.27
I want to edit the names in the entire 2nd column so that only the last part will stay (meaning delete anything before that, so in fact leaving what comes after the last _).
I looked into different solutions using a combination of cut and sed, but couldn't understand how the code should be built.
Would appreciate any tips and help!
Thank you!

Here's one way:
perl -pe 's/^\S+\s+\K\S+_//'
For every line of input (-p) we execute some code (-e ...).
The code performs a subtitution (s/PATTERN/REPLACEMENT/).
The pattern matches as follows:
^ beginning of string
\S+ 1 or more non-whitespace characters (the first column)
\s+ 1 or more whitespace characters (the space after the first column)
\K do not treat the text matched so far as part of the final match
\S+ 1 or more non-whitespace characters (the second column)
_ an underscore
Because + is greedy (it matches as many characters as possible), \S+_ will match everything up to the last _ in the second column.
Because we used \K, only the rest of the pattern (i.e. the part of the match that lies in the second column) gets replaced.
The replacement string is empty, so the match is effectively removed.

With sed:
sed 's/ [^ ]*_/ /' file
Replace first space followed by non-space characters ([^ ]*) followed by _ widh one space.

Related

What do you understand by this RegEx?

I´m working with VBA and trying to split a string into three columns, almost all strings are like Company Name 3567782 Agent Name.pdf
With this pattern I want to match all the text before a space and digits (1st group), the digits (2nd group) and all the text after the space and before the .pdf (3rd group).
strPattern = "^(.+)\n(\d{4,10})\n(.+).pdf"
I recall spaces in python are \s but saw in VBA are \n.
Can you help me find the right pattern for what I´m looking for?
As I put in my comment, I use the https://regex101.com site. There are others but I find this one the most helpful to me.
When I put in your regex
^(.+)\n(\d{4,10})\n(.+).pdf
and test string
Company Name 3567782 Agent Name.pdf
the first thing I notice is that the regex does not match the test string (see right side under MATCH INFORMATION).
Here are a couple things that I saw:
\n is newline, not space. In regex, space is " ".
Your last "." in ".pdf" is not registering as a literal period, it's a token that matches any character. To match a literal period, you need \.
If we change those two things it returns three groups that seem to match what you are looking for.
^(.+) (\d{4,10}) (.+)\.pdf
It looks like for the digits, you are looking for between 4 and 10 digits. If that's correct, it looks like your regex is good. You could put in a handful of example strings into the TEST STRING area and make sure that it works in all cases.
I'd use either of these:
(?:(?:([a-zA-Z]+\.?)|(\d+)))
capture a-Z greedy with a possible . to allow for the .pdf or capture digits
this version excludes the space [ ] or \s
or keep the search structured so you can control what goes in and out of each column
^(\w+\s\w+)|(\d+)|(\w+\s\w+\.\w+$)
\b or ^ - word boundary or start of string
(\w+\s\w+) - 1st capture \w+ - any alpha numeric char greedily, followed by 1 x space (use \s* or \s+ for more), followed again by alpha numeric greedily
|(\d+) - alteration - \d+ - capture just digits
`|(\w+\s\w+.\w+$) - similar to 1st group but allows for the '.' of pdf and bounds to the end of string (\G or $).
you could optionally build the '.' into the 1st group like my top answer, but for neatness and better control I prefer the 2nd.

unix command to replace anything between between two delimiter positions

Please help me with a unix command to replace anything between two delimiter positions.
For ex: I have multiple files with below header data and I want replace the data between * delimiters at 9th and 10th position
ISA*00* *00* *ZZ*80881 *ZZ*TNC0022 *190115*1237*^*00501*000320089*0*P*|~
My output should like this:
ISA*00* *00* *ZZ*80881 *ZZ*TNC0022 *190327*1237*^*00501*000320089*0*P*|~
Try this:
perl -pe 's/^((?:[^*]*\*){9})([^*]+)(.*)/${1}190327$3/'
The regexp searches for 9 occurences {9} of anything but not being a star [^*] followed by a star \* and stores all in the first capture group. The second capture is at least one character not being a star [^*]+. And the third capture is the rest of the line.
A matching line gets replaced by the first part ${1}, your new value 190327 and the third part $3.

VIM line count in status bar with thousands separator?

Is it possible to display the line count in the VIM status bar with thousands separators, preferably custom thousands separators?
Example:
set statusline=%L
should lead to "1,234,567" instead of "1234567".
I've found a way but it looks a bit crazy:
set statusline=%{substitute(line('$')\,'\\d\\zs\\ze\\%(\\d\\d\\d\\)\\+$'\,'\,'\,'g')}
The first round of backslashes is just for set (I have to escape , and \ itself).
What I'm actually setting the option to is this string:
%{substitute(line('$'),'\d\zs\ze\%(\d\d\d\)\+$',',','g')}
As a format string, this line contains one formatting code, which is %{...}. Everything in ... is evaluated as an expression and the result substituted back in.
The expression I'm evaluating is (spaces added (if I had added them to the real code, I would've had to escape them for set again, forcing yet more backslashes)):
substitute(line('$'), '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g')
This is a call to the substitute function. The arguments are the source string, the regex, the replacement string, and a list of flags.
The string we're starting with is line('$'). This call returns the number of lines in the current buffer (or rather the number of the last line in the buffer). This is what %L normally shows.
The search pattern we're looking for is \d(\d\d\d)+$ (special vim craziness removed), i.e. a digit followed by 1 or more groups of 3 digits, followed by the end of the string. Grouping is spelled \%( \) in vim, and "1 or more" is \+, which gives us \d\%(\d\d\d\)\+$. The last bit of magic is \zs\ze. \zs sets the start of the matched string; \ze sets the end. This works as if everything before \zs were a look-behind pattern and everything after \ze were a look-ahead pattern.
What this amounts to is: We're looking for every position in the source string that is preceded by a digit and followed by exactly N digits (where N is a multiple of 3). This works like starting at the right and going left, skipping 3 digits each time. These are the positions where we need to insert a comma.
That's what the replacement string is: ',' (a comma). Because we're matching a string of length 0, we're effectively inserting into the source string (by replacing '' with ',').
Finally, the g flag says to do this with all matches, not just the first one.
TL;DR:
line('$') gives us the number of lines
substitute(..., '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g') adds commas where we want them
%{ } lets us embed arbitrary expressions into statusline

Get lines that end with "$" in a text file

I have an output like this:
a/foo bar /
b/c/foo sth /xyz
cc/bar ghj /axz/byz
What i want is just this line:
a/foo bar /
To be more clear, I want those line ending with a specific string. I want to grep lines that have a / character at their last column.
You can use $ like this:
$ grep '/$' file
a/foo bar /
As $ stands for end of line, /$ matches those lines whose last character is a /.
grep '/$'
slash is not special character for grep and $ means match expression at the end of a line.
You can even grep the last column with only backlash at last column (but not the only column in the line)
I assumed tha the last column of a line is a string with more than one white space in front the string and no more character after the string. This assumption does not fulfill the requirement if there has only one column in that line because it does not need space in front of it to show it is last column if there has only one column.
By enable perl regular expressions (-P),
grep -P '\s+/$'
\s means matches any whitespace character (space, tab, newline)
plus sign means match 1 or more times for preceding element
$ means end of string
OR refer to Character Classes and Bracket Expressions
grep '[[:space:]]\+/$'
OR
grep '[[:blank:]]\+/$'
‘[:blank:]’ Blank characters: space and tab.
‘[:space:]’ Space characters: in the ‘C’ locale, this is tab, newline,
vertical tab, form feed, carriage return, and space. It is a synonym for '\s'.
Refer to #fedorqui, the backslash after ]] is used to distinguish with
the literal +. Thanks for the explanations.
Sorry if wrong for perl answer because I never use or learn Perl expression but really hope can help you find the last column slash so may be you can read these for more information for searching backspace with slash at end of line
grep with regexp: whitespace doesn't match unless I add an assertion
Regular expressions in Perl

Vim Multi-line search pattern

I have a substitute command that captures and displays submatch() values in the replacement string. But I have another line of information that I want to parse below this line. That line is always the first line after an empty line, though the number of lines TO that empty line varies. For example:
The first important line I want to capture is here
Stuff I don't want.
A few more lines of stuff I don't want...
Second line I want to capture.
This pattern repeats a hundred or so times in a document. I can substitute "The First Important Line" fine, but shouldn't that search pattern include a way to jump down to the first empty line and then pick up the next "Second line I want to capture." ?? I could then place the contents of that second line into submatch parenthesis and substitute them where needed (right?).
If so, I cannot discover the way to extend the first search pattern to capture the "Second line" Suggestions or correcting my approach would be greatly appreciated.
Someone has already dealt with a similar problem. Below I provide their solution and the detailed description.
/^\nF\d\_.\{-}\_^\n\zs.*/+
It means "Find a block of lines that start with F and a digit,
then scan forward to the next blank line and select the line after that."
Part of regex
Meaning
^\n
Matches the start of a line, followed by a newline - i.e a blank line
F\d
The next line starts with an F followed by a digit
\_.\{-}
\_. is like ., but also matches newline. \{-} matches the minimum number of the preceeding \_.. (If I were to use * instead of \{-}, it would match to near the end-of file.)
\_^\n
Matches a blank line. \_^ is like ^, but ^ only works at the start of a regular expression.
\zs
When the match is finished, set the start of match to this point. I use this because I don't want the preceding text to be highlighted.
.*
Matches the whole line.
The + after the regular expression tells Vim to put the cursor on the line after the selection.
I think I read about offsets, but I can't find the bit in the help that is relevant right now. As such, my other solution would be to record a macro to do what you want:
qa/[Your pattern]<CR>jddq
You could then execute this macro with #a and repeat with ##; or run it a lot of times (e.g., 999#a).

Resources