I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file
Related
I have a file like this
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq"rst"|uv||xyz
abc|def||ghi|jklm|"nopq"r"st"|uv||xyz
The 6th Column could be double quoted. I want to replace all the occurances of double quotes in this field with a backslash-double quote (\")
I wish my output to look like
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
I have tried combinations of below, but ending short each time
sed -i 's/\"/\\\"/2' file.txt (this replaces only 2nd occurrence)
sed -i 's/\"/\\\"/2g' file.txt (this replaces only 2nd occurrence and all rest also)
My file will be having millions of rows; so I may need a sed or awk command only.
Please help.
You may use this awk solution in any version of awk:
awk 'BEGIN {FS=OFS="|"} {
c1 = substr($6, 1, 1)
c2 = substr($6, length($6), 1)
s = substr($6, 2, length($6)-2)
gsub(/"/, "\\\"", s)
$6 = c1 s c2
} 1' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:
$ sed 's/"/\\"/g; s/|\\"/|"/g; s/\\"|/"|/g' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
The above will work in any sed.
This might work for you (GNU sed):
sed -E 's/[^|]*/\n&\n/6 # isolate the 6th field
h # make a copy
s/"/\\"/g # replace " by \"
s/\\(")\n|\n\\(")/\1\n\2/g # repair start and end "s
H # append amended line to copy
g # get copies to current line
s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file # swap fields
Surround the 6th field by newlines and make a copy in the hold space.
Replace all "'s by \"'s and remove the \'s at the start and end of the field if the field begins and ends in "'s
Append the amended line to the copy and replace the current line by the doubled line.
Using pattern matching replace copied line 6th field by the amended one.
I need to filter the output of a command.
I tried this.
bpeek | grep nPDE
My problem is that I need all matches of nPDE and the line after the found file. So the output would be like:
iteration nPDE
1 1
iteration nPDE
2 4
The best case would be if it would show me the found line only once and then only the line after it.
I found solutions with awk, But as far as I know awk can only read files.
There is an option for that.
grep --help
...
-A, --after-context=NUM print NUM lines of trailing context
Therefore:
bpeek | grep -A 1 'nPDE'
With awk (for completeness since you have grep and sed solutions):
awk '/nPDE/{c=2} c&&c--'
grep -A works if your grep supports it (it's not in POSIX grep). If it doesn't, you can use sed:
bpeek | sed '/nPDE/!d;N'
which does the following:
/nPDE/!d # If the line doesn't match "nPDE", delete it (starts new cycle)
N # Else, append next line and print them both
Notice that this would fail to print the right output for this file
nPDE
nPDE
context line
If you have GNU sed, you can use an address range as follows:
sed '/nPDE/,+1!d'
Addresses of the format addr1,+N define the range between addr1 (in our case /nPDE/) and the following N lines. This solution is easier to adapt to a different number of context lines, but still fails with the example above.
A solution that manages cases like
blah
nPDE
context
blah
blah
nPDE
nPDE
context
nPDE
would like like
sed -n '/nPDE/{$p;:a;N;/\n[^\n]*nPDE[^\n]*$/!{p;b};ba}'
doing the following:
/nPDE/ { # If the line matches "nPDE"
$p # If we're on the last line, just print it
:a # Label to jump to
N # Append next line to pattern space
/\n[^\n]*nPDE[^\n]*$/! { # If appended line does not contain "nPDE"
p # Print pattern space
b # Branch to end (start new loop)
}
ba # Branch to label (appended line contained "nPDE")
}
All other lines are not printed because of the -n option.
As pointed out in Ed's comment, this is neither readable nor easily extended to a larger amount of context lines, but works correctly for one context line.
I have a rather large file. What is common to all is the hostname to break each section example :
HOSTNAME:host1
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 2
HOSTNAME:host-2
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 1
The above prints
As you see above, in between each section there are other sections broken down by key words or lines that have specific values
I like to use a oneliner to print host name for each section and then print which ever lines I want to extract under each hostname section
Can you please help. I am using now grep -C 10 HOSTNAME | gerp -C pattern
but this assumes that there are 10 lines in each section. This is not an optimal way to do this; can someone show a better way. I also need to be able to print more than one line under each pattern that I find . So if I find data1 and there are additional lines under it I like to grab and print them
So output of command would be like
grep -C 10 HOSTNAME | grep data 1
grep -C 10 HOSTNAME | grep -A 2 data 1
HOSTNAME:Host1
data 1
HOSTNAME:Hoss2
data 1
Beside Grep I use this sed command to print my output
sed -r '/HOSTNAME|shared/!d' filename
The only problem with this sed command is that it only prints the lines that have patterns shared & HOSTNAME in them. I also need to specify the number of lines I like to print in my case under the line that matched patterns shared. So I like to print HOSTNAME and give the number of lines I like to print under second search pattern shared.
Thanks
awk to the rescue!
$ awk -v lines=2 '/HOSTNAME/{c=lines} NF&&c&&c--' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
print lines number of lines including pattern match, skips empty lines.
If you want to specify secondary keyword instead number of lines
$ awk -v key='data 1' '/HOSTNAME/{h=1; print} h&&$0~key{print; h=0}' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
Here is a sed twoliner:
sed -n -r '/HOSTNAME/ { p }
/^\s+data 1/ {p }' hostnames.txt
It prints (p)
when the line contains a HOSTNAME
when the line starts with some whitespace (\s+) followed by your search criterion (data 1)
non-mathing lines are not printed (due to the sed -n option)
Edit: Some remarks:
this was tested with GNU sed 4.2.2 under linux
you dont need the -r if your sed version does not support it, replace the second pattern to /^.*data 1/
we can squash everything in one line with ;
Putting it all together, here is a revised version in one line, without the need for the extended regex ( i.e without -r):
sed -n '/HOSTNAME/ { p } ; /^.*data 1/ {p }' hostnames.txt
The OP requirements seem to be very unclear, but the following is consistent with one interpretation of what has been requested, and more importantly, the program has no special requirements, and the code can easily be modified to meet a variety of requirements. In particular, both search patterns (the HOSTNAME pattern and the "data 1" pattern) can easily be parameterized.
The main idea is to print all lines in a specified subsection, or at least a certain number up to some limit.
If there is a limit on how many lines in a subsection should be printed, specify a value for limit, otherwise set it to 0.
awk -v limit=0 '
/^HOSTNAME:/ { subheader=0; hostname=1; print; next}
/^ *data 1/ { subheader=1; print; next }
/^ *data / { subheader=0; next }
subheader && (limit==0 || (subheader++ < limit)) { print }'
Given the lines provided in the question, the output would be:
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
(Yes, I know the variable 'hostname' in the awk program is currently unused, but I included it to make it easy to add a test to satisfy certain obvious requirements regarding the preconditions for identifying a subheader.)
sed -n -e '/hostname/,+p' -e '/Duplex/,+p'
The simplest way to do it is to combine two sed commands ..
Been stuck on this for a while, managed to remove two columns completely from it but now I need to remove two columns (3 in total) within the 1 column heading. I've attached a snippit from my csv file.
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
2014-09-17 10-20-39 UTC;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
2014-09-17 10-20-41 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-43 UTC;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
2014-09-17 10-20-45 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-47 UTC;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
2014-09-17 10-20-49 UTC;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
What I'm wanting to do is remove yyyy-mm-dd and also UTC, leaving just 10-20-39 underneath the timestamp column heading. I've tried removing them but I can't seem to do it without taking out the headings.
Thanks to anyone who can help me with this
A perl way:
perl -pe 's/^.+? (.+?) .+?;/$1;/ if $.>1' file
Explanation
The -pe means "print each line after applying the script to it". The script itself simply substitutes identifies the 3 first non-whitespace words and replaces them with the 2nd of the three ($1 since the pattern was captured). This is only run if the current line number ($.) is greater than 1.
An awk way
awk -F';' '(NR>1){sub(/[^ ]* /,"",$1); sub(/ [^ ]*$/,"",$1)}1;' OFS=";" file
Here, we set the input field delimiter to ; and use sub() to remove the 1st and last word from the 1st field.
This following sed command works for you:
sed '1!s/^[^ ]\+ //;1!s/ UTC//'
Explanations:
1! Do not apply to the first line.
s/^[^ ]\+ // Remove the first group of non-space characters at line beginning ("2014-09-17 " in your case).
s/ UTC// Remove the string " UTC".
Assuming the csv file is stored as a.csv, then
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
prints the results to standard output, and
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv > b.csv
saves the result to b.csv.
EDITED:
Added: sample results:
[pengyu#GLaDOS tmp]$ sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
10-20-39;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
10-20-41;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-43;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
10-20-45;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-47;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
10-20-49;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
I need to remove lines with a duplicate value. For example I need to remove line 1 and 3 in the block below because they contain "Value04" - I cannot remove all lines containing Value03 because there are lines with that data that are NOT duplicates and must be kept. I can use any editor; excel, vim, any other Linux command lines.
In the end there should be no duplicate "UserX" values. User1 should only appear 1 time. But if User1 exists twice, I need to remove the entire line containing "Value04" and keep the one with "Value03"
Value01,Value03,User1
Value02,Value04,User1
Value01,Value03,User2
Value02,Value04,User2
Value01,Value03,User3
Value01,Value03,User4
Your ideas and thoughts are greatly appreciated.
Edit: For clarity and leaving words out from the editing process.
The following Awk command removes all but the first occurrence of a value in the third column:
$ awk -F',' '{
if (!seen[$3]) {
seen[$3] = 1
print
}
}' textfile.txt
Output:
Value01,Value03,User1
Value01,Value03,User2
Value01,Value03,User3
Value01,Value03,User4
same thing in Perl:
perl -F, -nae 'print unless $c{$F[2]}++;' textfile.txt
this uses autosplit mode: "-F, -a" splits by comma and places the result into #F array