How to remove duplicate rows and create index in awk

How to remove duplicate rows and create index in awk - linux

I have tab delimited files as shown below:
CNV_chr1_12623251_12632176 8925 3 RR123 XX
CNV_chr1_13398757_13402091 3334 4 RR123 YY
CNV_chr1_13398757_13402091 3334 4 RR224 YY
CNV_chr1_14001365_14004064 2699 1 RR123 YX
CNV_chr1_14001365_14004064 2699 1 RR224 YX
Columns $1 and $2 stay identical. In this case, i would need to remove the duplicate row by indexing with the value in 4th column. and add an additional $5 with number of strings separated by comma in $4. Sample output shown below:
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
CNV_chr1_13398757_13402091 3334 4 RR123,RR124 2 YY
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
Any working soultion would be helpful.

Try this:
awk '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt
The output:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX
For tab-delimited input just add -F"\t" to awk:
awk -F"\t" '($1 in ar){ar[$1]=ar[$1]; br[$1]=br[$1]","$4; next;}
{br[$1]=$4; $4="REPLACE_ME"; ar[$1]=$0}
END{for(key in ar){c=split(br[key],s,",")
gsub("REPLACE_ME", br[key] FS c, ar[key])
print ar[key]}}' test.txt
and get:
CNV_chr1_14001365_14004064 2699 1 RR123,RR224 2 YX
CNV_chr1_13398757_13402091 3334 4 RR123,RR224 2 YY
CNV_chr1_12623251_12632176 8925 3 RR123 1 XX

Related

Translate specific columns to rows

I have this file (space delimited) :
bc1 no 12
bc1 no 15
bc1 yes 4
bc2 no 8
bc3 yes 14
bc3 yes 12
bc4 no 2
I would like to get this output :
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2
1st column : one occurence of the first column in the input file
2nd : number of this occurence in the input file
3rd : 3rd column translated in row with ";" delimiter
4th : sum of the last column
I can do what I want with the "no/yes" column :
awk -F' ' 'NF>2{a[$1] = a[$1]";"$2}END{for(i in a){print i" "a[i]}}' test.txt | sort -k1,1n

With your shown samples, please try following awk code. Since $1(first column) is always sorted we need not to sort it so coming up with this solution here.
awk '
prev!=$1 && prev{
print prev OFS count,value,sum
count=sum=0
prev=value=""
}
{
prev=$1
value=(value?value ";":"") $2
count++
sum+=$NF
}
END{
if(prev){
print prev OFS count,value,sum
}
}
' Input_file

Here's a solution with newer versions of datamash (thanks glenn jackman for the tips about count and --collapse-delimiter):
datamash -t' ' -g1 --collapse-delimiter=';' count 2 collapse 2 sum 3 <ip.txt
With older versions (mine is 1.4) and awk:
$ datamash -t' ' -g1 count 2 collapse 2 sum 3 <ip.txt
bc1 3 no,no,yes 31
bc2 1 no 8
bc3 2 yes,yes 26
bc4 1 no 2
$ <ip.txt datamash -t' ' -g1 count 2 collapse 2 sum 3 | awk '{gsub(/,/, ";", $3)} 1'
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2
-t helps to set space as field separator. Column 2 is collapsed by using column 1 as the key. sum 3 helps to find the total of the numbers. count 2 helps to count the number of collapsed rows.
Then, awk is used to change the , in the third column to ;

One alternative awk idea:
awk '
function print_row() {
if (series)
print key,c,series,sum
c=sum=0
series=sep=""
}
{ if ($1 != key) # if 1st column has changed then print previous data set
print_row()
key=$1
c++
series=series sep $2
sep=";"
sum+=$3
}
END { print_row() } # flush last data set to stdout
' input
This generates:
bc1 3 no;no;yes 31
bc2 1 no 8
bc3 2 yes;yes 26
bc4 1 no 2

Grep a word and from this position grep upwards for another word

I have a file like this:
Record 1
x1 5
x2 0 7 0'BCD
x31 18
x45 45
x67 4
Record 2
x1 9
x2 0 6 0'BCD
x3 8
x35 6
x45 7
x88 3
Record 3
x1 5
x2 0 5 0'BCD
x4 18
x35 16
x98 3
Record 4
x1 5
x2 0 4 0'BCD
x4 18
x35 16
x45 77
x98 3
For each record, I am interested of the values in front of x45 (if it exists in the record, if not exist then skip this record totally). And in case x45 is found, go up to get the value of x2
So the output desired will be (Note that Record3 doesn't have x45 so it is skipped:
45 , 0 7 0'BCD
7 , 0 6 0'BCD
77 , 0 4 0'BCD
I can guarantee that if x45 exist then for sure x2 will exist
How can I do this with awk/sed/grep ?

Another awk solution:
awk -v search="x45" -v before="x2" '
$1==before{ p=""; for(i=2; i<=NF; i++)p=((p=="") ? "" : p " ") $i}
$1==search{ print $2 " , " p }
' file
45 , 0 7 0'BCD
7 , 0 6 0'BCD
77 , 0 4 0'BCD
Or with some left-aligned formatting:
$ awk -v search="x45" -v before="x2" '
$1==before{ p=""; for(i=2; i<=NF; i++)p=((p=="") ? "" : p " ") $i}
$1==search{ printf "%-3s, %s" ORS, $2, p }
' file
45 , 0 7 0'BCD
7 , 0 6 0'BCD
77 , 0 4 0'BCD

$ awk -F'\n' -v RS= '{print $NF, $2}' file
Address: street#3 Job: Dentist
Address: street#4 Job: Engineer
Address: street#5 Job: Doctor
making the label input parameter
$ awk -F'\n' -v RS= -v term='Address' '{for(i=1;i<=NF;i++)
if($i~/^Job/) job=$i
else if($i~term) {a=$i; break}
if(a) print a, job;
a=job=""}' file
it shouldn't print if the search term is not found but your example doesn't cover that case. Note that as you specified if the search term is "Name" it won't work since "Job" appears after that.

$ sed -nE '/^x2/h;/^x45/{H;x;s|x2(.*)\nx45(.*)|\2 ,\1|p}' file
45 7
7 41
77 133
You can pipe that to column to get a visually better output:
$ sed -nE '/^x2/h;/^x45/{H;x;s|x2(.*)\nx45(.*)|\2 ,\1|p}' file | column -t
45 , 0 7 0'BCD
7 , 0 6 0'BCD
77 , 0 4 0'BCD

Compare columns from two files and print not match

I want to compare the first 4 columns of file1 and file2. I want to print all lines from file1 + the lines from file2 that are not in file1.
File1:
2435 2 2 7 specification 9-8-3-0
57234 1 6 4 description 0-0 55211
32423 2 44 3 description 0-0 24242
File2:
2435 2 2 7 specification
7624 2 2 1 namecomplete
57234 1 6 4 description
28748 34 5 21 gateway
32423 2 44 3 description
832758 3 6 namecomplete
output:
2435 2 2 7 specification 9-8-3-0
57234 1 6 4 description 0-0 55211
32423 2 44 3 description 0-0 24242
7624 2 2 1 namecomplete
28748 34 5 21 gateway
832758 3 6 namecomplete
I don't understand how to print things that don't match.

You can do it with an awk script like this:
script.awk
FNR == NR { mem[ $1 $2 $3 $4 $5 ] = 1;
print
next
}
{ key = $1 $2 $3 $4 $5
if( ! ( key in mem) ) print
}
And run it like this: awk -f script.awk file1 file2 .
The first part memorizes the first 5 fields, prints the whole line and moves to the next line. This part is exclusively applied to lines from the first file.
The second part is only applied to lines from the second file. It checks if the line is not in mem, in that case the line was not in file1 and it is printed.

How to use paste command for different lengths of columns

I have:
file1.txt file2.txt file3.txt
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
I want to paste all these three columns in ofile.txt
I tried with
paste file1.txt file2.txt file3.txt > ofile.txt
Result I got in ofile.txt:
ofile.txt:
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4
Which should come
ofile.txt
8 2 2
1 2 1
8 1 0
3 3
5 3
3
4

You can try this paste command in bash using process substitution:
paste <(sed 's/^[[:blank:]]*//' file1.txt) file2.txt file3.txt
8 2 2
1 2 1
8 8 0
3 3
5 3
3
4
sed command is used to remove leading whitespace from file1.txt.

I can reproduce your output when I make inputfiles with tabs.
paste also uses tabs betwen the columns and does this how he thinks it should.
You see the results when I replace the tabs with -:
# more x* | tr '\t' '-'
::::::::::::::
x1
::::::::::::::
-1a
-1b
-1c
-1d
::::::::::::::
x2
::::::::::::::
-2a
-2b
::::::::::::::
x3
::::::::::::::
-3a
-3b
-3c
-3d
-3e
-3f
-3g
# paste x? | tr '\t' '-'
-1a--2a--3a
-1b--2b--3b
-1c---3c
-1d---3d
---3e
---3f
---3g
Think how you want it. When you want correct indents, you need to append lines with tab for files with less lines. Or manipulate the result: 3 tabs into 4 and 4 tabs at the beginning of the line to 5 tabs.
sed -e 's/\t\t\t/\t\t\t\t/' -e 's/^\t\t\t\t/\t\t\t\t\t/'

Move rows in csv files - Linux command

I have a csv file which has many columns and rows of this type.
foo.csv
1 1 x1
1 1 x2
1 1 y1
1 1 y2
. . .
What command should I use or what script should I create in order to get it to look like this:
foo.csv
1 1 x1 1 1 y1
1 1 x2 1 1 y2
. . . . . .
In other words, to move the last rows from foo.csv, starting from 1 1 y1, to the columns in the first rows.
Thanks in advance!
Paul

$ cat /tmp/1
abc 1
def 2
ghi 3
pqr 4
uvw 5
xyz 6
$ paste <(head -3 /tmp/1) <(tail -3 /tmp/1)
abc 1 pqr 4
def 2 uvw 5
ghi 3 xyz 6
Here you take three strings for the first column and threee string for the second.
If you don't known the number of the strings you want to take, you can find it first.
$ n=$(cat -n /tmp/1 | grep pqr | cut -f1)
$ paste <(head -$((n-1)) /tmp/1) <(sed 1,$((n-1))d /tmp/1)
abc 1 pqr 4
def 2 uvw 5
ghi 3 xyz 6

Adjust awk record selector as neede
paste -d " " <(awk '$3~"x"' tests.txt) <(awk '$3~"y"' tests.txt)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to remove duplicate rows and create index in awk - linux

Related

Translate specific columns to rows

Grep a word and from this position grep upwards for another word

Compare columns from two files and print not match

How to use paste command for different lengths of columns

Move rows in csv files - Linux command

Categories

Resources