Search for a line containing characters, hyphen and numbers following hyphen with sed - search

Let's suppose a file with following content named file.txt is used as a source.
12345
ABC-12
XCD-13
BCD-12345
Some text on this line
*+-%&
What I am looking for with sed would be to get only ABC-12 line and remove hyphen so the desired output would be ABC12
So far I have been able to get numbers only with following command:
sed "s/[^0-9]//g" file.txt
Output
12345
12
13
12345
I got the closest with grep with command:
grep -oP 'ABC-\b\d{2}\b' file.txt
Output
ABC-12
How should this be constructed with sed and also hyphen removed from the output?
Note also that numbers after ABC can be considered as changing like a variable so the idea would be to search for "ABC and numbers following it after hyphen" instead of searching for "ABC-12" directly.

idea would be to search for ABC and numbers following it after hyphen
You may use this sed for this:
sed '/^ABC-[0-9][0-9]$/!d; s/-//' file
ABC12
Here:
/^ABC-[0-9][0-9]$/!d: Searches input for a line that starts with ABC followed by a hyphen followed by 2 digits and end. All non-matching lines are removed due to !d as command.
s/-//p: Removes - from the match
Update:
As per comment below if ABC- text is not at the start then use:
sed -nE 's/.*(ABC)-([0-9][0-9]).*/\1\2/p' file

You can use
sed -En '/^.*(ABC)-([0-9]+).*/{s//\1\2/p;q}'
Details:
-En - E enables POSIX ERE and n suppresses default line output
^.*(ABC)-([0-9]+).* - start of string (here, line), then any text, then ABC captured into Group 1, -, then one or more digits (captured into Group 2) and then the rest of the line
{s//\1\2/p;q} - if there was a match
s//\1\2/p - take the same pattern and replace with Group 1 + Group 2 text, and print the result of the substitution
q - quit (you only need the first match here).
See the online demo:
#!/bin/bash
s='ABC-12
XCD-13
BCD-12345
Some text on this line
*+-%&'
sed -En '/^.*(ABC)-([0-9]+).*/{s//\1\2/p;q}' <<< "$s"
Output:
ABC12

Related

How to extract a specific text from gz file?

I need to extract the 5 to 11 characters from my fastq.gz data this data is just too large for running in R. So I was wondering if I can do it directly in Linux command line?
The fastq file looks like this:
#NB501399:67:HFKTCBGX5:1:11101:13202:1044 1:N:0:CTTGTA
GAGGTNACGGAGTGGGTGTGTGCAGGGCCTGGTGGGAATGGGGAGACCCGTGGACAGAGCTTGTTAGAGTGTCCTAGAGCCAGGGGGAACTCCAGGCAGGGCAAATTGGGCCCTGGATGTTGAGAAGCTGGGTAACAAGTACTGAGAGAAC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE6
#NB501399:67:HFKTCBGX5:1:11101:1109:1044 1:N:0:CTTGTA
TAGGCNACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGACTCAAGCGATCCTCCAGCCTCAGCCTCCCGAGTAGCTGGGACTACAG
+
And I only want to extract the 5 to 11 character which located in sequence part (for the first one is TNACGG, for the second is CNACCT) and makes it a new txt file. Can I do that?
You can use GNU sed with zcat:
zcat fastq.gz | sed -n '2~5{s/.\{4\}\(.\{6\}\).*/\1/;p}'
-n means lines are not printed by default
2~5 means start with line 2, match every fifth line
when the "address" matches, the substitution remembers the fifth to tenth character in \1 and replaces the whole line with it, p prints the result
Another using zgrep and positive lookbehind:
$ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
TNACGG
CNACCT
Explained:
zgrep : man zgrep: search possibly compressed files for a regular expression
-o Print only the matched (non-empty) parts of a matching line
-P Interpret the pattern as a Perl-compatible regular expression (PCRE).
(?<=^[ACTGN]{4}) positive lookbehind
[ACTGN]{6} match 6 named characters that are preceeded by above
foo.gz my test file
$ zcat fastq.gz | awk '(NR%5)==2{print substr($0,5,6)}'
TNACGG
CNACCT

How to use sed to replace a string that contains the slash?

I have a text file that contain a lot of mess text.
I used grep to get all the text that contains the string prod like this
cat textfile | grep "<host>prod*"
The result
<host>prod-reverse-proxy01</host>
<host>prod-reverse-proxy01</host>
<host>prod-reverse-proxy01</host>
Continually, i used sed with the intention to remove all the "host" part
cat textfile | grep "<host>prod*" | sed "s/<host>//g"; "s/</host>//g"
But only the first "host" was removed.
prod-reverse-proxy01</host>
prod-reverse-proxy01</host>
prod-reverse-proxy01</host>
How can i remove the other "/host" part?
sed -n -e "s/^<host>\(.*\)<\/host>/\1/p" textfile
sed can process your file directly. No need to grep or cat.
-n is there to suppress any lines that do not match. Last 'p' in the script will print all matching files.
Script dissection:
s/.../.../...
is the search/replace form. The bit between the first and the second '/' is what you search for. The bit between the second and third is what you replace it with. The last part is any commands you want to apply to the replacement.
Search:
^<host>\(.*\)<\/host>
finds all lines beginning with <host> followed by any text (.*) followed by </host>. Any text between <host> and </host> is stored into internal variable '1' using '(' and ')'. Note that (, ) and / (in </host>) have to be escaped.
Replace:
\1
Replace found text with contents of variable 1 (1 has to be escaped, otherwise, everything is replaced by character '1'.
Commands:
p
Print resulting line (after replacement).
Note: Your search involves removing two similar but not identical strings (<host> and </host>).
I think this sed is enough
sed 's/<[/]*host>//g' infile

Do not print unmatched text with sed

I want to print only matched lines and strip unmatched ones, but with following:
$ echo test12 test | sed -n 's/^.*12/**/p'
I always get:
** test
instead of:
**
What am I doing wrong?
[edit1]
I provide more information of what I need - and actually I should start with it. So, I have a command which produced lots of lines of output, I want to grab only parts of the lines - the ones that matches, and strip the result. So in the above example 12 was meant to find end of matched part of the line, and instead of ** I should have put & which represents matched string. So the full example is:
echo test12 test | sed -n 's/^.*12/&/p'
which produces exactly the same output as input:
test12 test
the expected output is:
test12
As suggested I started to find a grep alternative and the following looks promising:
$ echo test12 test | grep -Eo "^.*12"
but I dont see how to format the matched part, this only strips unmatched text.
EDIT: In some cases, the -E flag might be needed for sed. But then the brackets don't need to be escaped anymore. check your sed's man page.
I think what you are looking for is this:
echo test12 test | sed -n 's/^\(.*12\).*$/\1/p'
if you want to discard the rest of the line, you have to match it as well, but not include it in the output. the \( and \) denote a group that is then referenced by the \1.
Good luck :)
Additional information on sed:
sed works on lines, and the ampersand characters represents the entire line that was matched by the given regular expression. if a regex is "open" at the end (i.e. doesn't end with the endline character ($), it acts as if .*$ is appended to the match string. (not sure if that is how it is implemented, but could very well be.)
Try:
echo test12 test | sed -n 's/^.*/**/p'
You don't need to match the number 12, since that is already being done in your regex.
Your regular expression is matching anything from the beginning of the line until the expression '12'. All the matched expression is replaced with '**', that is why you get '** test'. If you want only match I recommend you using grep.

Extracting key word from a log line

I have a log which got like this :
.....client connection.....remote=/xxx.xxx.xxx.xxx]].......
I need to extract all lines in the log which contain the above,and print just the ip after remote=.. This would be something in the pattern :
grep "client connection" xxx.log | sed -e ....
Using grep:
grep -oP '(?<=remote=/)[^\]]+' file
o is to extract only the pattern, instead of entire line.
P is to match perl like regex. In this case, we are using "negative look behind". It will try to match set of characters which is not "]" which is preceeded by remote=/
grep -oP 'client connection.*remote=/\K.*?(?=])' input
Prints anything between remote=/ and closest ] on the lines which contain client connection.
Or by using sed back referencing: Here the line is divided into three parts/groups which are later referred by \1 \2 or \3. Each group is enclosed by ( and ). Here IP address belongs to 2nd group, so whole line is replaced by 2nd group which is IP address.
sed -r '/client connection/ s_(^.*remote=/)(.*?)]](.*)_\2_g' input
Or using awk :
awk -F'/|]]' '/client connection/{print $2}' input
Try this:
grep 'client connection' test.txt | awk -F'[/\\]]' '{print $2}'
Test case
test.txt
---------
abcd
.....client connection.....remote=/10.20.30.40]].......
abcs
.....client connection.....remote=/11.20.30.40]].......
.....client connection.....remote=/12.20.30.40]].......
Result
10.20.30.40
11.20.30.40
12.20.30.40
Explanation
grep will shortlist the results to only lines matching client connection. awk uses -F flag for delimiter to split text. We ask awk to use / and ] delimiters to split text. In order to use more than one delimiter, we place the delimiters in [ and ]. For example, to split text by = and :, we'd do [=:].
However, in our case, one of the delimiters is ] since my intent is to extract IP specifically from /x.x.x.x] by spitting the text with / and ]. So we escape it ]. The IP is the 2nd item from the splitting.
A more robust way, improved over this answer would be to also use GNU grep in PCRE mode with -P for perl style regEx match, but matching both the patterns as suggested in the question.
grep -oP "client connection.*remote=/\K(\d{1,3}\.){3}\d{1,3}" file
10.20.30.40
11.20.30.40
12.20.30.40
Here, client connection.*remote matches both the patterns in the lines and extracts IP from the file. The \K is a PCRE syntax to ignore strings up to that point and print only the capture group following it.
(\d{1,3}\.){3}\d{1,3}
To match the IP i.e. 3 groups of digits separated by dots of length from 1 to 3 followed by 4th octet.

Replace first six commas for each line in a text file

I want to replace the first six , for each line in a text file using sed or something similar in linux.
There are more than six , on each line, but only the first six should be replaced by |.
Sed doesn't really support the notion of "the first n occurrences", only "the n-th occurrence"; GNU sed has one for "replace all matches from the n-th on", which is not what you want in this case. To get the first six commas replaced, you have to call the s command six times:
sed 's/,/|/;s/,/|/;s/,/|/;s/,/|/;s/,/|/;s/,/|/' infile
If, however, you know that there are no | in the file and you have GNU sed, you can do this:
sed 's/,/|/g;s/|/,/7g' infile
This replaces all commas with pipes, then turns the pipes from the 7th on back to commas.
If you do have pipes beforehand, you can turn them into something that you know isn't in the string first:
sed 's/|/~~/g;s/,/|/g;s/|/,/7g;s/~~/|/g' infile
This makes all | into ~~ first, then all , into |, then the | from the 7th on back into ,, and finally the ~~ back into |.
Testing on this input file:
,,,,,,X,,,,,,
,,,|,,,|,,,|,,,|
the first and third command result in
||||||X,,,,,,
||||||||,,,|,,,|
The second one would fail on the second line because there are already pipe characters.
This might work for you (GNU sed):
sed 'y/,/\n/;s/\n/,/7g;y/\n/|/' file
Translate all ,'s to \n's, then replace from the seventh \n to the end of line by ,'s, then replace the remaining \n's by |'s.
Use the following pattern in sed: sed 's/old/new/<number>'
Where <number> is the number of times you want this pattern applied.
You can replace <number> with g to apply the pattern to all occurrences.
You can try this sed,
sed -r ':loop; s/^([^,]*),/\1|/g; /^([^|]*\|){6}/t; b loop' file
(OR)
sed ':loop; s/^\([^,]*\),/\1|/g; /^\([^|]*|\)\{6\}/t; b loop' file
Test:
$ cat file
a,b,c,d,e,f,g,h,i,j,k
$ sed -r ':loop; s/^([^,]*),/\1|/g; /^([^|]*\|){6}/t; b loop' file
a|b|c|d|e|f|g,h,i,j,k
Note: This will work only if you do not have any pipe(|) before that.

Resources