I have written the following command
#!/bin/bash
awk -v value=$newvalue -v row=$rownum -v col=1 'BEGIN{FS=OFS=","} NR==row {$col=value}1' "${file}".csv >> temp.csv && mv temp.csv "${file}".csv
Sample Input of file.csv
Header,1
Field1,Field2,Field3
1,ABC,4567
2,XYZ,7890
Assuiming $newvalue=3 ,$rownum=4 and col=1, then the above code will replace:
Required Output
Header,1
Field1,Field2,Field3
1,ABC,4567
3,XYZ,7890
So if I know the row and column, is it possible to replace the said value using grep, sed?
Edit1: Field3 will always have a unique value for their respective rows. ( in case that info helps anyway)
Assuming your CSV file is as simple as what you show (no commas in quoted fields), and your newvalue does not contain characters that sed would interpret in a special way (e.g. ampersands, slashes or backslashes), the following should work with just sed (tested with GNU sed):
sed -Ei "$rownum s/[^,]*/$newvalue/$col" file.csv
Demo:
$ cat file.csv
Header,1
Field1,Field2,Field3
1,ABC,4567
3,XYZ,7890
$ rownum=3
$ col=2
$ newvalue="NEW"
$ sed -Ei "$rownum s/[^,]*/$newvalue/$col" file.csv
$ cat file.csv
Header,1
Field1,Field2,Field3
1,NEW,4567
3,XYZ,7890
Explanations: $rownum is used as the address (here the line number) where to apply the following command. s is the sed substitute command. [^,]* is the regular expression to search for and replace: the longest possible string not containing a comma. $newvalue is the replacement string. $col is the occurrence to replace.
If newvalue may contain ampersands, slashes or backslashes we must sanitize it first:
sanitizednewvalue=$(sed -E 's/([/\&])/\\\1/g' <<< "$newvalue")
sed -Ei "$rownum s/[^,]*/$sanitizednewvalue/$col" file.csv
Demo:
$ newvalue='NEW&\/&NEW'
$ sanitizednewvalue=$(sed -E 's/([/\&])/\\\1/g' <<< "$newvalue")
$ echo "$sanitizednewvalue"
NEW\&\\\/\&NEW
$ sed -Ei "$rownum s/[^,]*/$sanitizednewvalue/$col" file.csv
$ cat file.csv
Header,1
Field1,Field2,Field3
1,NEW&\/&NEW,4567
3,XYZ,7890
With sed, how about:
#!/bin/bash
newvalue=3
rownum=4
col=1
sed -i -E "${rownum} s/(([^,]+,){$((col-1))})[^,]+/\\1${newvalue}/" file.csv
Result of file.csv
Header,1
Field1,Field2,Field3
1,ABC,4567
3,XYZ,7890
${rownum} matches the line number.
(([^,]+,){n}) matches the n-time repetition of the group of
non-comma characters followed by a comma. Then it should be the substring
before the target (to be substituted) column by assigning n to
col - 1.
Let's Try to Implement sed command
Let us consider a sample CSV file with the following content:
$ cat file
Solaris,25,11
Ubuntu,31,2
Fedora,21,3
LinuxMint,45,4
RedHat,12,5
To remove the 1st field or column :
$ sed 's/[^,]*,//' file
25,11
31,2
21,3
45,4
12,5
This regular expression searches for a sequence of non-comma([^,]*) characters and deletes them which results in the 1st field getting removed.
To print only the last field, OR remove all fields except the last field:
$ sed 's/.*,//' file
11
2
3
4
5
This regex removes everything till the last comma(.*,) which results in deleting all the fields except the last field.
To print only the 1st field:
$ sed 's/,.*//' file
Solaris
Ubuntu
Fedora
LinuxMint
RedHat
This regex(,.*) removes the characters starting from the 1st comma till the end resulting in deleting all the fields except the last field.
To delete the 2nd field:
$ sed 's/,[^,]*,/,/' file
Solaris,11
Ubuntu,2
Fedora,3
LinuxMint,4
RedHat,5
The regex (,[^,]*,) searches for a comma and sequence of characters followed by a comma which results in matching the 2nd column, and replaces this pattern matched with just a comma, ultimately ending in deleting the 2nd column.
Note: To delete the fields in the middle gets more tougher in sed since every field has to be matched literally.
To print only the 2nd field:
$ sed 's/[^,]*,\([^,]*\).*/\1/' file
25
31
21
45
12
The regex matches the first field, second field and the rest, however groups the 2nd field alone. The whole line is now replaced with the 2nd field(\1), hence only the 2nd field gets displayed.
Print only lines in which the last column is a single digit number:
$ sed -n '/.*,[0-9]$/p' file
Ubuntu,31,2
Fedora,21,3
LinuxMint,45,4
RedHat,12,5
The regex (,[0-9]$) checks for a single digit in the last field and the p command prints the line which matches this condition.
To number all lines in the file:
$ sed = file | sed 'N;s/\n/ /'
1 Solaris,25,11
2 Ubuntu,31,2
3 Fedora,21,3
4 LinuxMint,45,4
5 RedHat,12,5
This is simulation of cat -n command. awk does it easily using the special variable NR. The '=' command of sed gives the line number of every line followed by the line itself. The sed output is piped to another sed command to join every 2 lines.
Replace the last field by 99 if the 1st field is 'Ubuntu':
$ sed 's/\(Ubuntu\)\(,.*,\).*/\1\299/' file
Solaris,25,11
Ubuntu,31,99
Fedora,21,3
LinuxMint,45,4
RedHat,12,5
This regex matches 'Ubuntu' and till the end except the last column and groups each of them as well. In the replacement part, the 1st and 2nd group along with the new number 99 is substituted.
Delete the 2nd field if the 1st field is 'RedHat':
$ sed 's/\(RedHat,\)[^,]*\(.*\)/\1\2/' file
Solaris,25,11
Ubuntu,31,2
Fedora,21,3
LinuxMint,45,4
RedHat,,5
The 1st field 'RedHat', the 2nd field and the remaining fields are grouped, and the replacement is done with only 1st and the last group , resuting in getting the 2nd field deleted.
To insert a new column at the end(last column) :
$ sed 's/.*/&,A/' file
Solaris,25,11,A
Ubuntu,31,2,A
Fedora,21,3,A
LinuxMint,45,4,A
RedHat,12,5,A
The regex (.*) matches the entire line and replacing it with the line itself (&) and the new field.
To insert a new column in the beginning(1st column):
$ sed 's/.*/A,&/' file
A,Solaris,25,11
A,Ubuntu,31,2
A,Fedora,21,3
A,LinuxMint,45,4
A,RedHat,12,5
Same as last example, just the line matched is followed by the new column
I hope this will help. Let me know if you need to use Awk or any other command.
Thank you
I have a file like this
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq"rst"|uv||xyz
abc|def||ghi|jklm|"nopq"r"st"|uv||xyz
The 6th Column could be double quoted. I want to replace all the occurances of double quotes in this field with a backslash-double quote (\")
I wish my output to look like
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
I have tried combinations of below, but ending short each time
sed -i 's/\"/\\\"/2' file.txt (this replaces only 2nd occurrence)
sed -i 's/\"/\\\"/2g' file.txt (this replaces only 2nd occurrence and all rest also)
My file will be having millions of rows; so I may need a sed or awk command only.
Please help.
You may use this awk solution in any version of awk:
awk 'BEGIN {FS=OFS="|"} {
c1 = substr($6, 1, 1)
c2 = substr($6, length($6), 1)
s = substr($6, 2, length($6)-2)
gsub(/"/, "\\\"", s)
$6 = c1 s c2
} 1' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
If this isn't all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for:
$ sed 's/"/\\"/g; s/|\\"/|"/g; s/\\"|/"|/g' file
abc|def||ghi|jklm||uv||xyz
abc|def||ghi|jklm|nopqrst|uv||xyz
abc|def||ghi|jklm|nopq\"rst|uv||xyz
abc|def||ghi|jklm|"nopqrst"|uv||xyz
abc|def||ghi|jklm|"nopq\"rst"|uv||xyz
abc|def||ghi|jklm|"nopq\"r\"st"|uv||xyz
The above will work in any sed.
This might work for you (GNU sed):
sed -E 's/[^|]*/\n&\n/6 # isolate the 6th field
h # make a copy
s/"/\\"/g # replace " by \"
s/\\(")\n|\n\\(")/\1\n\2/g # repair start and end "s
H # append amended line to copy
g # get copies to current line
s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file # swap fields
Surround the 6th field by newlines and make a copy in the hold space.
Replace all "'s by \"'s and remove the \'s at the start and end of the field if the field begins and ends in "'s
Append the amended line to the copy and replace the current line by the doubled line.
Using pattern matching replace copied line 6th field by the amended one.
I have a tab separated text file. In column 1 and 2 there are family and individual ids that start with a character followed by number as follow:
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
NA1008 NA1008
NA1009 NA1009
I would like to replace NA with HG in both the columns. I am very new to linux and tried the following code and some others:
awk '{sub("NA","HG",$2)';print}' input file > output file
Any help is highly appreciated.
Converting my comment to answer now, use gsub in spite of sub here. Because it will globally substitute NA to HG here.
awk 'BEGIN{FS=OFS="\t"} {gsub("NA","HG");print}' inputfile > outputfile
OR use following in case you have several fields and you want to perform substitution only in 1st and 2nd fields.
awk 'BEGIN{FS=OFS="\t"} {sub("NA","HG",$1);sub("NA","HG",$2);print}' inputfile > outputfile
Change sub to gsub in 2nd code in case multiple occurrences of NA needs to be changed within field itself.
The $2 in your call to sub only replaces the first occurrence of NA in the second field.
Note that while sed is more typical for such scenarios:
sed 's/NA/HG/g' inputfile > outputfile
you can still use awk:
awk '{gsub("NA","HG")}1' inputfile > outputfile
See the online demo.
Since there is no input variable in gsub (that performs multiple search and replaces) the default $0 is used, i.e. the whole record, the current line, and the code above is equal to awk '{gsub("NA","HG",$0)}1' inputfile > outputfile.
The 1 at the end triggers printing the current record, it is a shorter variant of print.
Notice /^NA/ position at the beginning of field:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file
HG1005 HG1005
HG1006 HG1006
HG1007 HG1007
HG1008 HG1008
HG1009 HG1009
and save it:
awk '{for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
If you have a tab as separator:
awk 'BEGIN{FS=OFS="\t"} {for(i=1;i<=NF;i++)if($i ~ /^NA/) sub(/^NA/,"HG",$(i))} 1' file > outputfile
I have a text file that contains numerous lines that have partially duplicated strings. I would like to remove lines where a string match occurs twice, such that I am left only with lines with a single match (or no match at all).
An example output:
g1: sample1_out|g2039.t1.faa sample1_out|g334.t1.faa sample1_out|g5678.t1.faa sample2_out|g361.t1.faa sample3_out|g1380.t1.faa sample4_out|g597.t1.faa
g2: sample1_out|g2134.t1.faa sample2_out|g1940.t1.faa sample2_out|g45.t1.faa sample4_out|g1246.t1.faa sample3_out|g2594.t1.faa
g3: sample1_out|g2198.t1.faa sample5_out|g1035.t1.faa sample3_out|g1504.t1.faa sample5_out|g441.t1.faa
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
In this case I would like to remove lines 1, 2, and 3 because sample1 is repeated multiple times on line 1, sample 2 is twice on line 2, and sample 5 is repeated twice on line 3. Line 4 would pass because it contains only one instance of each sample.
I am okay repeating this operation multiple times using different 'match' strings (e.g. sample1_out , sample2_out etc in the example above).
Here is one in GNU awk:
$ awk -F"[| ]" '{ # pipe or space is the field reparator
delete a # delete previous hash
for(i=2;i<=NF;i+=2) # iterate every other field, ie right side of space
if($i in a) # if it has been seen already
next # skit this record
else # well, else
a[$i] # hash this entry
print # output if you make it this far
}' file
Output:
g4: sample1_out|g2357.t1.faa sample2_out|g686.t1.faa sample3_out|g1251.t1.faa sample4_out|g2021.t1.faa
The following sed command will accomplish what you want.
sed -ne '/.* \(.*\)|.*\1.*/!p' file.txt
grep: grep -vE '(sample[0-9]).*\1' file
Inspiring from Glenn's answer: use -i with sed to directly do changes in the file.
sed -r '/(sample[0-9]).*\1/d' txt_file
I use the following sed command in order to replace string in CSV line
( the condition to replace the string is to match the number in the beginning of the CSV file )
SERIAL_NUM=1
sed "/$SERIAL_NUM/ s//OK/g" file.csv
the problem is that I want to match only the number that start in the beginning of the line ,
but sed match other lines that have this number
example:
in this example I want to replace the word - STATUS to OK but only in line that start with 1 ( before the "," separator )
so I do this
SERIAL_NUM=1
more file.csv
1,14556,43634,266,242,def,45,STATUS
2,4345,1,43,57,24,657,SD,STATUS
3,1,WQ,435,676,90,3,44f,STATUS
sed -i "/$SERIAL_NUM/ s/STATUS/OK/g" file.csv
more file.csv
1,14556,43634,266,242,def,45,OK
2,4345,1,43,57,24,657,SD,OK
3,1,WQ,435,676,90,3,44f,OK
but sed also replace the STATUS to OK also in line 2 and line 3 ( because those lines have the number 1 )
please advice how to change the sed syntax in order to match only the number that start the line before the "," separator
remark - solution can be also with perl line liner or awk ,
You can use anchor ^ to make sure $SERIAL_NUM only matches at start and use , after that to make sure there is a comma followed by this number:
sed "/^$SERIAL_NUM,/s/STATUS/OK/g" file.csv
Since this answer was ranked fifth in the Stackoverflow perl report but had no perl content, I thought it would be useful to add the following - instead of removing the perl tag :-)
#!/usr/bin/env perl
use strict;
use warnings;
while(<DATA>){
s/STATUS/OK/g if /^1\,/;
print ;
}
__DATA__
1,14556,43634,266,242,def,45,STATUS
2,4345,1,43,57,24,657,SD,STATUS
3,1,WQ,435,676,90,3,44f,STATUS
or as one line:
perl -ne 's/STATUS/OK/g if /^1\,/;' file.csv