Replace with SED multiple occurences at the same line - linux

I want to replace all slashes "/" between alphanumeric with backslash+slash "\/" apart from the last one on each string, e.g.
nocareNocare abc\/def/ghi/mno\/pq/r abc\/def\/ghi/mno\/pq/r
should become:
nocareNocare abc\/def\/ghi\/mno\/pq/r abc\/def\/ghi\/mno\/pq/r
I use:
sed 's/\(.*\)\([[:alnum:]]\)\/\([[:alnum:]]\)\(\S*\)\(\\\|\/\)/\1\2\\\/\3\4\//g'
Short explanation: match
any string + alnum + / + any non-white + / or \
But it only replace one case, so I need to run it 3 times to replace all 3 occurences. Looks like the first time it matches all the way to :
>nocareNocare abc\/def/ghi/mno\/pq/r abc\/def\/ghi/
instead of
>nocareNocare abc\/def/

sed -e :a -e 's|\([a-z0-9]\)/\([a-z0-9][^ ]*[a-z0-9]/[a-z0-9]\)|\1\\/\2|;ta' filename
Loosely translated, this says "replace a lone slash followed by some other stuff in the string, followed by another lone slash, with backslash-slash and that same stuff (and the second slash). And after making such a replacement, start over again."

You can use a perl command line solution based on the following regEx's
(?<!\\)
not preceded by a backslash
(?!\w+\s)
not followed by word characters terminating in whitespace
perl -pe 's;(?<!\\)/(?!\w+\s);\\/;g' file
nocareNocare abc\/def\/ghi\/mno\/pq/r abc\/def\/ghi\/mno\/pq/r

With GNU sed:
sed -E 's:([^\])/:\1\\/:g;s:\\/([^\]*( |$)):/\1:g' file
Two s command here:
s:([^\])/:\1\\/:g replace all / not preceded by a \ with \/
s:\\/([^\]*( |$)):/\1:g replace last \/ before space or end of line with /

Related

Linux sed regular expression

I have a string:
2021-05-27 10:40:50.678117 PID529270:TID 47545543550720:SID 1673488:TXID 786092740:QID 140: INFO:MEMCONTEXT:MemContext state: mem[cur/hi/max] = 9135 / 96586 / 96576 MB, VM[cur/hi/max] = 9161 / 21841178 / 100663296 MB
I want to get the number 9135 that first occurrence between '=' and '/', right now, my command as below, it works, but I don't think it's perfect:
sed -r 's/.* = ([0-9]+) .* = .*/\1 /'
Need a more neat one, please help advise.
You can use
sed -En 's~.*= ([0-9]+) /.*=.*~\1~p'
See the online demo.
An awk solution:
awk -F= '{gsub(/\/.*|[^0-9]/,"",$2);print $2}'
See this demo.
Details:
-En - E (or r as in your example) enables the POSIX ERE syntax and n suppresses the default line output
.*= ([0-9]+) /.*=.* - matches any text, = + space, captures one or more digits into Grou 1, then matches a space, /, then any text, = and again any text
\1 - replaces with Group 1 value
p - prints the result of the substitution.
Here, ~ are used as regex delimiters in order not to escape / in the pattern.
awk:
-F= - sets the input field separator to =
gsub(/\/.*|[^0-9]/,"",$2) - removes any non-digit or / and the rest of the string
print $2 - prints the modified Field 2 value.
You could also get the first match with grep using -P for Perl-compatible regular expressions.
grep -oP "^.*? = \K\d+(?= /)"
^ Start of string
.*? Match as least as possible chars
= Match space = and space
\K\d+ Forget what is matched so far
(?= /) Assert a space and / to the right
Output
9135
See a bash demo
Since you want the material between the first = and the first /, ignoring the spaces, you could use:
sed -E -e 's%^[^=]*= ([^/]*) /.*$%\1%'
This uses Extended Regular Expressions (ERE) (-E; -r also works with GNU sed), and searches from the start of the line for a sequence of 'not =' characters, the = character, a space, anything that's not a slash (which is remembered), another space, a slash, and anything that follows, replacing it all with what was remembered. The ^ and $ anchors aren't crucial; it will work the same without them. The % symbols are used instead of / because the searched-for pattern includes a /. If your sure there'll never be any spaces other than the first and last ones between the = and /, you can use [^ /]* in place of [^/]* and there should be some small (probably immeasurable) performance benefit.

Sed - remove all semicolons between a pair of double-quotes

I have a dirty csv-file containing rows with quoted semicolons. I am trying to clear these semicolons with commands like:
sed -rin 's/(^.*\;.*\;\".*)(\;)(.*\"\;.*$)/\1\3/' file
But somehow this doesn't remove all of the semicolons. Some of the problematic rows look like this:
;0;"One ▒;)";123; ... ; nth-1column;
;0;"Two ▒;)";456; ... ; nthcolumn;
When they should be cleaned to:
;0;"One ▒)";123; ... ; nth-1column;
;0;"Two ▒)";456; ... ; nthcolumn;
There might be some encoding issues, but this should be ignored by the regex. I am only interested in removing the semicolons, the encoding is handled afterwards.
Any ideas on how to aggressively clean all semicolons contained within double-quotes?
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^;"]*"[^"]*)*"[^";]*);/\1/;ta' file
Make a back reference starting from the front of each line that contain characters not between double quotes and quoted strings that do not contain ;'s followed by double quote and characters that are neither double-quote or semi-colon. If the next character is a semi-colon, remove it and repeat until failure, then print the result.
An alternative:
sed -E '/^([^"]*("[^";]*"[^"]*)*"[^";]*);/{s//\n\1/;D}' file
or:
sed -E 's/^([^"]*("[^";]*"[^"]*)*"[^";]*);/\n\1/;T;D' file
EDIT:
sed -nE '/^([^"]*("[^";]*"[^"]*)*"[^";]*);/{:a;s//\1/;ta;p}' file
You can use
sed ':a;s/^\(\([^"]*;\?\|"[^";]*";\?\)*"[^";]*\);/\1/;ta' file
See an online demo.
It works like this:
:a - sets a label
^\(\([^"]*;\?\|"[^";]*";\?\)*"[^";]*\); - find:
^ - start of string
\(\([^"]*;\?\|"[^";]*";\?\)*"[^";]*\) - Group 1:
\([^"]*;\?\|"[^";]*";\?\)* - zero or more occurrences of
[^"]*;\? - zero or more chars other than " and then an optional ;
\| - or
"[^";]*";\? - ", then zero or more chars other than " and ; and then a " and then an optional ;
" - a " char
[^";]* - zero or more chars other than a ; and "
; - a semi-colon
\1 - replace with Group 1 value
ta - if there was a substitution, go back to a label position.

How to remove the first three character from the fasta file header

I have a fasta file like this:
>rna-XM_00001.1
actact
>rna-XM_00002.1
atcatc
How do I remove the 'rna-' so it become
>XM_00001.1
actact
>XM_00002.1
atcatc
What you're showing is the file contents? Then sed should be able to do this:
sed 's/^>rna-/>/' < inputfile > outputfile
Explanation:
The first character of the command-line to sed is s, which tells sed to do substitution
The / are delimiters
The ^ tells sed to look only at the start of a line
The next >rna- is the pattern to match at the start of a line
The next > is the replacement substituted for the pattern
If, instead, you want to always remove the first four characters after a > as long as they end in -, you could use:
sed 's/^>...-/>/' < inputfile > outputfile
Explanation:
This is similar to above, except the pattern to match at the start of a line is >...-. The pattern is a regexp, where a . matches any single character. So this pattern matches any line starting with >, followed by any three characters, followed by -.

Two pattern match on same sed command

I have the following sed command:
sed -n '/^out(/{n;p}' ${filename} | sed -n '/format/ s/.*format=//g; s/),$//gp; s/))$//gp'
I tried to do it as one line as in:
sed -n '/^out(/{n;}; /format/ s/.*format=//g; s/),$//gp; s/))$//gp' ${filename}
But that also display the lines I don't want (those that do not match).
What I have is a file with some strings as in:
entry(variable=value)),
format(variable=value)),
entry(variable=value)))
out(variable=value)),
format(variable=value)),
...
I just want the format lines that came right after the out entry. and remove those trailing )) or ),
You can use this sed command:
sed -nr '/^out[(]/ {n ; s/.*[(]([^)]+)[)].*/\1/p}' your_file
Once a out is found, it advanced to the next line (n) and uses the s command with p flag to extract only what is inside parenthesises.
Explanation:
I used [(] instead of \(. Outside brackets a ( usually means grouping, if you want a literal (, you need to escape it as \( or you can put it inside brackets. Most RE special characters dont need escaping when put inside brackets.
([^)]+) means a group (the "(" here are RE metacharacters not literal parenthesis) that consists of one or more (+) characters that are not (^) ) (literal closing parenthesis), the ^ inverts the character class [ ... ]

Using sed to match anything and \s

I've got the following:
sed -i "s/SYNFLOOD_RATE = \"100/s\"/SYNFLOOD_RATE = \"10\s\"/g"
Question is how do I avoid this error?
/bin/sed: -e expression #1, char 28: unknown option to `s'
And is there a way to do a wild card match and replace with sed?
You have too many slashes, 4 when there should be 3. Use a different delimiter; comma (,), bang (!), hash (#), and at (#) are common alternatives.
sed -i "s,SYNFLOOD_RATE = \"100/s\",SYNFLOOD_RATE = \"10\s\",g"
Note that you have "100/s" in the original and "10s" (no slash) in the replacement. To actually insert a backslash, you'd need to enter 4 of them: 10\\\\s. Each pair will get reduced to a single by the shell and then the remaining double will be interpreted as a literal backslash by sed.
If you want to first grep then substitute :
sed -i '/SYNFLOOD_RATE = \"100/s/"\/SYNFLOOD_RATE = \"10\s\"/replacement/g'
But the delimiter can be anything else than /, see :
sed -i '/SYNFLOOD_RATE = "100/s#"/SYNFLOOD_RATE = "10\s"#replacement#g'
( the delimiter here is #)

Resources