How to remove lines with duplicate pair of words?

How to remove lines with duplicate pair of words? - linux

I have a file with multiple columns like
abc cvn bla..bla..n_columns
xnt yuk m_columns
abc cvn xxxx
vbh ast
sth rty
xnt yuk
I want to create a new file by comparing the repeated word pairs in first two columns.
The final file will look like
abc cvn bla..bla..n_columns
xnt yuk m_columns
vbh ast
sth rty

All you need is:
awk '!seen[$1,$2]++' file

If abc cvn xxxx appears before abc cvn bla..bla..n_columns I just want
to keep any of the line. It does not matter for me which line should
be there. Any of the line will be okay.
If the output sequence doesn't matter, you can use sort
sort -u -k1,2 file
otherwise you should use awk as suggested by devnull

sed -n 'H
$ {x
s/$/\
/
: again
s/\(\n\)\([^ ]\{1,\} \{1,\}[^ [:cntrl:]]\{1,\}\)\(.*\)\1\2[^[:cntrl:]]*\n/\1\2\3\1/
t again
s/\n\(.*\)\n/\1/
p
}' YourFile
based on any repeated peer of value (pair is character not space or \n separate by "space") in whole text with a loop while there is a peer finnded and replaced.
principle
H Append each line (sed work line by line in work buffer) from working buffer into the hold buffer (there is a working buffer and a hold buffer)
$ at the end
x swap working and hold buffer, so all the file is in working buffer but starting with a new line (due to Append action)
s/... Add a New line at the end (for later substitution process delimiter)
: again put a label anchor (for a later goto)
s/...// is the core of the process. Search a starting (after a new line) peer of word and a later same starting peer, if find, substitute the whole block with the part from start of block until second peer not included. (block start at first peer until new line on same line as second peer)
t again if substitution earlier is made, go to label again
s/.../ remove the added new line at start and end
p print the result
Sed is trying always to take the mose of a pattern so if there is more than 2 peer of 1 of the uniq peer, it first remove the last peer and go back until there is only 1

Related

Using sed to delete specific lines after LAST occurrence of pattern

I have a file that looks like:
this name
this age
Remove these lines and space above.
Remove here too and space below
Keep everything below here.
I don't want to hardcode 2 as the number of lines containing "this" can change. How can I delete 4 lines after the last occurrence of the string. I am trying sed -e '/this: /{n;N;N;N;N;d}' but it is deleting after the first occurrence of the string.

Could you please try following.
awk '
FNR==NR{
if($0~/this/){
line=FNR
}
next
}
FNR<=line || FNR>(line+4)
' Input_file Input_file
Output will be as follows with shown samples.
this: name
this: age
Keep everything below here.

You can also use this minor change to make your original sed command work.
sed '/^this:/ { :k ; n ; // b k ; N ; N ; N ; d }' input_file
It uses a loop which prints the current line and reads the next one (n) while it keeps matching the regex (the empty regex // recalls the latest one evaluated, i.e. /^this:/, and the command b k goes back to the label k on a match). Then you can append the next 3 lines and delete the whole pattern space as you did.
Another possibility, more concise, using GNU sed could be this.
sed '/^this:/ b ; /^/,$ { //,+3 d }' input_file
This one prints any line beginning with this: (b without label goes directly to the next line cycle after the default print action).
On the first line not matching this:, two nested ranges are triggered. The outer range is "one-shot". It is triggered right away due to /^/ which matches any line then it stays triggered up to the last line ($). The inner range is a "toggle" range. It is also triggered right away because // recalls /^/ on this line (and only on this line, hence the one-shot outer range) then it stays trigerred for 3 additional lines (the end address +3 is a GNU extension). After that, /^/ is no longer evaluated so the inner range cannot trigger again because // recalls /^this:/ (which is short cut early).

This might work for you (GNU sed):
sed -E ':a;/this/n;//ba;$!N;$!ba;s/^([^\n]*\n?){4}//;/./!d' file
If the pattern space (PS) contains this, print the PS and fetch the next line.
If the following line contains this repeat.
If the current line is not the last line, append the next line and repeat.
Otherwise, remove the first four lines of the PS and print the remainder.
Unless the PS is empty in which case delete the PS entirely.
N.B. This only reads the file once. Also the OP says
How can I delete 4 lines after the last occurrence of the string
However the example would seem to expect 5 lines to be deleted.

Output only the first pattern-line and its following line

I need to filter the output of a command.
I tried this.
bpeek | grep nPDE
My problem is that I need all matches of nPDE and the line after the found file. So the output would be like:
iteration nPDE
1 1
iteration nPDE
2 4
The best case would be if it would show me the found line only once and then only the line after it.
I found solutions with awk, But as far as I know awk can only read files.

There is an option for that.
grep --help
...
-A, --after-context=NUM print NUM lines of trailing context
Therefore:
bpeek | grep -A 1 'nPDE'

With awk (for completeness since you have grep and sed solutions):
awk '/nPDE/{c=2} c&&c--'

grep -A works if your grep supports it (it's not in POSIX grep). If it doesn't, you can use sed:
bpeek | sed '/nPDE/!d;N'
which does the following:
/nPDE/!d # If the line doesn't match "nPDE", delete it (starts new cycle)
N # Else, append next line and print them both
Notice that this would fail to print the right output for this file
nPDE
nPDE
context line
If you have GNU sed, you can use an address range as follows:
sed '/nPDE/,+1!d'
Addresses of the format addr1,+N define the range between addr1 (in our case /nPDE/) and the following N lines. This solution is easier to adapt to a different number of context lines, but still fails with the example above.
A solution that manages cases like
blah
nPDE
context
blah
blah
nPDE
nPDE
context
nPDE
would like like
sed -n '/nPDE/{$p;:a;N;/\n[^\n]*nPDE[^\n]*$/!{p;b};ba}'
doing the following:
/nPDE/ { # If the line matches "nPDE"
$p # If we're on the last line, just print it
:a # Label to jump to
N # Append next line to pattern space
/\n[^\n]*nPDE[^\n]*$/! { # If appended line does not contain "nPDE"
p # Print pattern space
b # Branch to end (start new loop)
}
ba # Branch to label (appended line contained "nPDE")
}
All other lines are not printed because of the -n option.
As pointed out in Ed's comment, this is neither readable nor easily extended to a larger amount of context lines, but works correctly for one context line.

insert consecutive number after .fa header id

I have a large .fa file that consists of multiple merged fasta files. Each file is separated by header line and begins with ">".
Here is an example:
>DPB1*04:01:01:01 [most similar sequence] for DPB1 in 3507009462
I would like to modify each header inside the file by adding a consecutive integer after each id. The id is the first sequence of characters after ">" and before the first space.
The modified header would look like this:
>DPB1*04:01:01:011 [most similar sequence] for DPB1 in 3507009462
I found some code that replaces the header by a consecutive number but not sure how to insert it after the header id.
cat youFile.fa | perl -ane 'if(/\>/){$a++;print ">$a\n"}else{print;}' > youFile_new.fa
Thanks for your help

$ perl -wpe 's/\s/++$i . " "/e if /^>/' input.fa
Explanation:
Substitute first occurrence of whitespace with counter variable and single space...
...but only if the line starts with >
Print every line (due to -p switch)

Linux - Remove line feed

Is there a way to use linux command to remove the LF's displayed below.
Each row should begin with string 'F|'. Unfortunate multiple rows in my Oracle db are stored with hex 0a LF which at spool causes linebreaks.
Thanks
$grep -nvB 1 '^F|' File.txt
4720156-F|29|204380|A|16060|Telephone Updated by DCA|99996319 ,
4720157: |manual|
--
6005453-F|29|121389|A|16060|Telephone Updated by DCA|96844599 ,
6005454: |new|
--
6354243-F|29|366910|A|16060|Telephone Updated by DCA|
6354244: |new|
--
13318314-F|29|397713|A|16060|Telephone Updated by DCA|97597079 ,
13318315: ,52094436|new|
--
13471591-F|29|17945|A|16060|Telephone Updated by DCA|47990248,94291610,
13471592: |new|
--
13471607-F|29|152501|A|16060|Telephone Updated by DCA|
13471608: ,90290027,38297606|new|
--
13944867-F|29|322564|A|16060|Telephone Updated by DCA|
13944868: |new|
User#db01.test processed$

So, you want the lines which do not begin with F| to be joined to the line before (which does). A solution with sed:
sed -n '/^F|/{x;2,$p;be};x;G;s/\n//;h;:e;${g;p}' File.txt
/^F|/ If line begins with F|:
x Exchange the contents of the hold and pattern spaces
2,$p If not the first line: print the (previously held) line
be Branch to label e
Otherwise (line doesn't begin with F|):
x Exchange the contents of the hold and pattern spaces
G Append hold space to pattern space (lines joined, but still LF embedded)
s/\n// Remove the LF
h Copy pattern space (joined line) to hold space
:e Label e (both cases above get here):
$ If the last line:
g Copy hold space to pattern space
p Print the (last) line

How can I swap two lines using sed?

Does anyone know how to replace line a with line b and line b with line a in a text file using the sed editor?
I can see how to replace a line in the pattern space with a line that is in the hold space (i.e., /^Paco/x or /^Paco/g), but what if I want to take the line starting with Paco and replace it with the line starting with Vinh, and also take the line starting with Vinh and replace it with the line starting with Paco?
Let's assume for starters that there is one line with Paco and one line with Vinh, and that the line Paco occurs before the line Vinh. Then we can move to the general case.

#!/bin/sed -f
/^Paco/ {
:notdone
N
s/^\(Paco[^\n]*\)\(\n\([^\n]*\n\)*\)\(Vinh[^\n]*\)$/\4\2\1/
t
bnotdone
}
After matching /^Paco/ we read into the pattern buffer until s// succeeds (or EOF: the pattern buffer will be printed unchanged). Then we start over searching for /^Paco/.

cat input | tr '\n' 'ç' | sed 's/\(ç__firstline__\)\(ç__secondline__\)/\2\1/g' | tr 'ç' '\n' > output
Replace __firstline__ and __secondline__ with your desired regexps. Be sure to substitute any instances of . in your regexp with [^ç]. If your text actually has ç in it, substitute with something else that your text doesn't have.

try this awk script.
s1="$1"
s2="$2"
awk -vs1="$s1" -vs2="$s2" '
{ a[++d]=$0 }
$0~s1{ h=$0;ind=d}
$0~s2{
a[ind]=$0
for(i=1;i<d;i++ ){ print a[i]}
print h
delete a;d=0;
}
END{ for(i=1;i<=d;i++ ){ print a[i] } }' file
output
$ cat file
1
2
3
4
5
$ bash test.sh 2 3
1
3
2
4
5
$ bash test.sh 1 4
4
2
3
1
5
Use sed (or not at all) for only simple substitution. Anything more complicated, use a programming language

A simple example from the GNU sed texinfo doc:
Note that on implementations other than GNU `sed' this script might
easily overflow internal buffers.
#!/usr/bin/sed -nf
# reverse all lines of input, i.e. first line became last, ...
# from the second line, the buffer (which contains all previous lines)
# is *appended* to current line, so, the order will be reversed
1! G
# on the last line we're done -- print everything
$ p
# store everything on the buffer again
h

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to remove lines with duplicate pair of words? - linux

All you need is: awk '!seen[$1,$2]++' file

Related

Using sed to delete specific lines after LAST occurrence of pattern

Output only the first pattern-line and its following line

insert consecutive number after .fa header id

Linux - Remove line feed

How can I swap two lines using sed?

Categories

Resources