I need to remove lines with a duplicate value. For example I need to remove line 1 and 3 in the block below because they contain "Value04" - I cannot remove all lines containing Value03 because there are lines with that data that are NOT duplicates and must be kept. I can use any editor; excel, vim, any other Linux command lines.
In the end there should be no duplicate "UserX" values. User1 should only appear 1 time. But if User1 exists twice, I need to remove the entire line containing "Value04" and keep the one with "Value03"
Value01,Value03,User1
Value02,Value04,User1
Value01,Value03,User2
Value02,Value04,User2
Value01,Value03,User3
Value01,Value03,User4
Your ideas and thoughts are greatly appreciated.
Edit: For clarity and leaving words out from the editing process.
The following Awk command removes all but the first occurrence of a value in the third column:
$ awk -F',' '{
if (!seen[$3]) {
seen[$3] = 1
print
}
}' textfile.txt
Output:
Value01,Value03,User1
Value01,Value03,User2
Value01,Value03,User3
Value01,Value03,User4
same thing in Perl:
perl -F, -nae 'print unless $c{$F[2]}++;' textfile.txt
this uses autosplit mode: "-F, -a" splits by comma and places the result into #F array
Related
I have a file like this
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq"rst"|uv||xyz
abc|def||ghi|jklm|"nopq"r"st"|uv||xyz
The 6th Column could be double quoted. I want to replace all the occurances of double quotes in this field with a backslash-double quote (\")
I wish my output to look like
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
I have tried combinations of below, but ending short each time
sed -i 's/\"/\\\"/2' file.txt (this replaces only 2nd occurrence)
sed -i 's/\"/\\\"/2g' file.txt (this replaces only 2nd occurrence and all rest also)
My file will be having millions of rows; so I may need a sed or awk command only.
Please help.
You may use this awk solution in any version of awk:
awk 'BEGIN {FS=OFS="|"} {
c1 = substr($6, 1, 1)
c2 = substr($6, length($6), 1)
s = substr($6, 2, length($6)-2)
gsub(/"/, "\\\"", s)
$6 = c1 s c2
} 1' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:
$ sed 's/"/\\"/g; s/|\\"/|"/g; s/\\"|/"|/g' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
The above will work in any sed.
This might work for you (GNU sed):
sed -E 's/[^|]*/\n&\n/6 # isolate the 6th field
h # make a copy
s/"/\\"/g # replace " by \"
s/\\(")\n|\n\\(")/\1\n\2/g # repair start and end "s
H # append amended line to copy
g # get copies to current line
s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file # swap fields
Surround the 6th field by newlines and make a copy in the hold space.
Replace all "'s by \"'s and remove the \'s at the start and end of the field if the field begins and ends in "'s
Append the amended line to the copy and replace the current line by the doubled line.
Using pattern matching replace copied line 6th field by the amended one.
I am having a csv to which I need to add a new column at the end and add a certain string to all rows of the csv in the newly added column.
Example csv:
os,num1,alpha1
Unix,10,A
Linux,30,B
Solaris,40,C
Fedora,20,D
Ubuntu,50,E
I tried using awk command and did not get expected result. I am not sure whether indexing or counting column number is right.
awk -F'[[:null:]]' '$2 && !$1{ $4="NA" }1'
Expected result is:
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA
You can use sed:
sed 's/$/,NA/' db1.csv > db2.csv
then edit the first line containing the column titles.
I'm not quite sure how you came up w/ that awk statement of yours, why you'd think that your file has NUL-terminated lines or that [[:null:]] has become a valid character class ...
The following, however, will do your bidding:
awk 'NR==1{print $0",code"}; NR>1{print $0",NA"}' example.csv
os,num1,alpha1,code
Unix,10,A,NA
Linux,30,B,NA
Solaris,40,C,NA
Fedora,20,D,NA
Ubuntu,50,E,NA
I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file
Been stuck on this for a while, managed to remove two columns completely from it but now I need to remove two columns (3 in total) within the 1 column heading. I've attached a snippit from my csv file.
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
2014-09-17 10-20-39 UTC;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
2014-09-17 10-20-41 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-43 UTC;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
2014-09-17 10-20-45 UTC;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
2014-09-17 10-20-47 UTC;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
2014-09-17 10-20-49 UTC;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
What I'm wanting to do is remove yyyy-mm-dd and also UTC, leaving just 10-20-39 underneath the timestamp column heading. I've tried removing them but I can't seem to do it without taking out the headings.
Thanks to anyone who can help me with this
A perl way:
perl -pe 's/^.+? (.+?) .+?;/$1;/ if $.>1' file
Explanation
The -pe means "print each line after applying the script to it". The script itself simply substitutes identifies the 3 first non-whitespace words and replaces them with the 2nd of the three ($1 since the pattern was captured). This is only run if the current line number ($.) is greater than 1.
An awk way
awk -F';' '(NR>1){sub(/[^ ]* /,"",$1); sub(/ [^ ]*$/,"",$1)}1;' OFS=";" file
Here, we set the input field delimiter to ; and use sub() to remove the 1st and last word from the 1st field.
This following sed command works for you:
sed '1!s/^[^ ]\+ //;1!s/ UTC//'
Explanations:
1! Do not apply to the first line.
s/^[^ ]\+ // Remove the first group of non-space characters at line beginning ("2014-09-17 " in your case).
s/ UTC// Remove the string " UTC".
Assuming the csv file is stored as a.csv, then
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
prints the results to standard output, and
sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv > b.csv
saves the result to b.csv.
EDITED:
Added: sample results:
[pengyu#GLaDOS tmp]$ sed '1!s/^[^ ]\+ //;1!s/ UTC//' < a.csv
timestamp;CPU;%usr;%nice;%sys;%iowait;%steal;%irq;%soft;%guest;%idle
10-20-39;-1;6.53;0.00;4.02;0.00;0.00;0.00;0.00;0.00;89.45
10-20-41;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-43;-1;1.98;0.00;1.98;5.45;0.00;0.50;0.00;0.00;90.10
10-20-45;-1;0.50;0.00;1.51;0.00;0.00;0.00;0.00;0.00;97.99
10-20-47;-1;0.50;0.00;1.50;0.00;0.00;0.00;0.00;0.00;98.00
10-20-49;-1;0.50;0.00;1.01;3.02;0.00;0.00;0.00;0.00;95.48
i have file "acl.txt"
192.168.0.1
192.168.4.5
#start_exceptions
192.168.3.34
192.168.6.78
#end_exceptions
192.168.5.55
and another file "exceptions"
192.168.88.88
192.168.76.6
I need to replace everything between #start_exceptions and #end_exceptions with content of exceptions file. I have tried many solutions from this forum but none of them works.
EDITED:
Ok, if you want to retain the #start and #stop, I will revert to awk:
awk '
BEGIN {p=1}
/^#start/ {print;system("cat exceptions");p=0}
/^#end/ {p=1}
p' acl.txt
Thanks to #fedorqui for tweaks in comments below.
Output:
192.168.0.1
192.168.4.5
#start_exceptions
192.168.88.88
192.168.76.6
#end_exceptions
192.168.5.55
p is a flag that says whether or not to print lines. It starts at the beginning as 1, so all lines are printed till I find a line starting with #start. Then I cat the contents of the exceptions file and stop printing lines till I find a line starting with #end, at which point I set the p flag back to 1 so remaining lines get printed.
If you want output to a file, add "> newfile" to the very end of the command like this:
awk '
BEGIN {p=1}
/^#start/ {print;system("cat exceptions");p=0}
/^#end/ {p=1}
p' acl.txt > newfile
YET ANOTHER VERSION IF YOU REALLY WANT TO USE SED
If you really, really want to do it with sed, you can use nested address spaces, firstly to select the lines between #start_exceptions and #end_exceptions, then again to select the first line within that and also lines other than the #end_exceptions line:
sed '
/^#start/,/^#end/{
/^#start/{
n
r exceptions
}
/^#end/!d
}
' acl.txt
Output:
192.168.0.1
192.168.4.5
#start_exceptions
192.168.88.88
192.168.76.6
#end_exceptions
192.168.5.55
ORIGINAL ANSWER
I think this will work:
sed -e '/^#end/r exceptions' -e '/^#start/,/^#end/d' acl.txt
When it finds /^#end/ it reads in the exceptions file. And it also deletes everything between /#start/ and /#end/.
I have left the matching slightly "loose" for clarity of expressing the technique.
You can use the following, based on Replace string with contents of a file using sed:
$ sed $'/end/ {r exceptions\n} ; /start/,/end/ {d}' acl.txt
192.168.0.1
192.168.4.5
192.168.88.88
192.168.76.6
192.168.5.55
Explanation
sed $'one_thing; another_thing' ac1.txt performs the two actions.
/end/ {r exceptions\n} if the line contains end, then read the file exceptions and append it.
/start/,/end/ {d} from a line containing start to a line containing end, delete all the lines.
I had problem with Mark Setchell's solution in MINGW. The caret was not picking up the beginning of line. Indeed, is the detection of the separator dependent on it being at the beginning of the line?
I came up with this awk alternative...
$ awk -v data="$(<exceptions)" '
BEGIN {p=1}
/#start_exceptions/ {print; print data;p=0}
/#end_exceptions/ {p=1}
p
' acl.txt