How to print file name with data from where data is coming? - linux

I have below mentioned files in path 1,
fb1.tril.cap
fb2.tril.cap
fb3.tril.cap
For example data in file fb1.tril.cap are like shown below,
AT99565 150 500 (DEST 81)
AT99565 101 501 (DEST 883)
AT99565 152 502 (419)
For example data in file fb2.tril.cap are like shown below,
AT99565 103 1503 (DEST 165)
AT99565 104 154 (DEST 199)
For example data in file fb3.tril.cap are like shown below,
RT61446 80 863 (DEST 968)
RT20447 32 39 (DEST 570)
RT51224 73 74 (592)
I had written code like shown below to print my required fields,
while read file_name
do
cat ${file_name} | awk -F' ' '$4 == "(DEST" { print
$1, $2, $3, $5}' | awk -F')' '{print $1, $2, $3, $4}' | uniq >> output.csv
done < path_1
I'm getting below output,
AT99565 150 500 81
AT99565 101 501 883
AT99565 103 1503 165
AT99565 104 154 199
RT61446 80 863 968
RT20447 32 39 570
But i want to print file name also along with data from where data is coming, like shown below,
AT99565 150 500 81 fb1.tril.cap
AT99565 101 501 883 fb1.tril.cap
AT99565 103 1503 165 fb2.tril.cap
AT99565 104 154 199 fb2.tril.cap
RT61446 80 863 968 fb3.tril.cap
RT20447 32 39 570 fb3.tril.cap
Can anyone help me to complete my job by printing file name as well along with the data. Thanks in advance.

First I am not able to test my code solution but this code might be run for you.
while read file_name
do
cat ${file_name} | awk -F' ' '$5 == "(DEST" { print
$1, $2, $3, $5}' | awk -F')' '{print $1, $2, $3, $4, $file_name}' | uniq >> output.csv
done < path_1

A sed one-liner:
sed 's/[()]\|DEST//g;F' fb*.tril.cap | sed -n 'h;n;G;s/\n/ /gp'
How it works:
s/[()]\|DEST//g: Instead of parsing (DEST and such, just remove them. What's left after that are the four desired items.
Then use sed's File name command to print the file name.
Since F prints immediately, a 2nd sed invocation is needed to juggle the output a bit.
If the output spacing is too uneven, add a tr to convert the spaces into tabs:
sed 's/[()]\|DEST//g;F' fb*.tril.cap | sed -n 'h;n;G;s/\n/ /gp' |
tr -s ' '

Using Perl one liner
> ls -l fb*tril*cap
-rw-r--r-- 1 aaaaa bbbbb 77 Dec 6 09:20 fb1.tril.cap
-rw-r--r-- 1 aaaaa bbbbb 58 Dec 6 09:21 fb2.tril.cap
-rw-r--r-- 1 aaaaa bbbbb 74 Dec 6 09:21 fb3.tril.cap
> perl -lane ' print $_,$ARGV if $F[3]=~/\(DEST/ and s/\(DEST //g and s/\)//g ' fb*tril*cap
AT99565 150 500 81 fb1.tril.cap
AT99565 101 501 883 fb1.tril.cap
AT99565 103 1503 165 fb2.tril.cap
AT99565 104 154 199 fb2.tril.cap
RT61446 80 863 968 fb3.tril.cap
RT20447 32 39 570 fb3.tril.cap
>

Related

Replacing value in column with another value in txt file using awk

I am new to linux and awk scripting. I have tab delim txt file like follows:
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 175 187 Sat 150 167
I would like replace only the value in last row, second column(175) with the value in the last row,5th column(150+1) so that my final output should look like
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
I tried awk '$2=$5+1' file.txt but it changes all the values in second column which I don't want. I want replace only 175 with 150(+1). Kindly guide me
The difficulty is that, unlike sed, awk does not tell us when we are working on the last row. Here is one work-around:
$ awk 'NR>1{print last} {last=$0} END{$0=last;$2=$5+1;print}' OFS='\t' file.txt
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
This works by keeping the previous line in the variable last. In more detail:
NR>1{print last}
For every row, except the first, print last.
last=$0
Update the value of last.
END{$0=last; $2=$5+1; print}
When we have reached the end of the file, update field 2 and print.
OFS='\t'
Set the field separator on output to a tab.
Alternate method
This approach reads the file twice, first to count the number of lines and the second time to change the last row. Consequently, this is less efficient but it might be easier to understand:
$ awk -v n="$(wc -l <file.txt)" 'NR==n{$2=$5+1} 1' OFS='\t' file.txt
AAA 134 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
Changing the first row instead
$ awk 'NR==1{$2=$5+1} 1' OFS='\t' file.txt
AAA 151 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 175 187 Sat 150 167
Changing the first row and the last row
$ awk 'NR==1{$2=$5+1} NR>1{print last} {last=$0} END{$0=last;if(NR>1)$2=$5+1;print}' OFS='\t' file.txt
AAA 151 145 Sat 150 167
AAA 156 167 Sat 150 167
AAA 151 187 Sat 150 167
#John1024 's answer is very informative.
awk have builtin getline function to process file.
It returns 1 on sucess, 0 on end of file and -1 on an error.
awk '{
line=$0;
if (getline == 0 ) {
$2=$5+1;
print $0;
} else {
print line RS $0;
}
}' OFS='\t' file.txt

awk with zero output

I have four column and I would like to do this:
INUPUT=
429 0 10 0
287 115 89 64
0 629 0 10
542 0 7 0
15 853 0 12
208 587 5 4
435 203 12 0
604 411 27 3
0 232 0 227
471 395 5 5
802 706 15 15
1288 1135 11 23
1063 386 13 2
603 678 7 14
0 760 0 11
awk '{if (($2+$4)/($1+$3)<0.2 || ($1+$3)==0) print $0; else if (($1+$3)/($2+$4)<0.2 || ($2+$4)==0) print $0; else print $0}' INPUT
But I have error message :
awk: cmd. line:1: (FILENAME=- FNR=3) fatal: division by zero attempted
Even if I have added condition:
...|| ($1+$3)==0...
Can somebody explain me what I am doing wrong?
Thank you so much.
PS: print $0 is just for illustration.
Move the "($1+$3) == 0" to be the first clause of the if statement. Awk will evalulate them in turn. Hence it still attempts the first clause of the if statement first, triggering the divide by zero attempt. If the first clause is true, it won't even attempt to evaulate the second one. So:-
awk '{if (($1+$3)==0 || ($2+$4)/($1+$3)<0.2) print $0; else if (($1+$3)/($2+$4)<0.2 || ($2+$4)==0) print $0; else print $0}' INPUT
You're already dividing by zero in your conditional statement ($1+$3=0 on the ninth line of your list). That's where the error comes from. You should change the ordering in your conditional statement: first verify that $1+$3!=0 and only then use it to define your next condition.

Compare two files and write the unmatched numbers in a new file

I have two files where ifile1.txt is a subset of ifile2.txt.
ifile1.txt ifile2.txt
2 2
23 23
43 33
51 43
76 50
81 51
100 72
76
81
89
100
Desire output
ofile.txt
33
50
72
89
I was trying with
diff ifile1.txt ifile2.txt > ofile.txt
but it is giving different format of output.
Since your files are sorted, you can use the comm command for this:
comm -1 -3 ifile1.txt ifile2.txt > ofile.txt
-1 means omit the lines unique to the first file, and -3 means omit the lines that are in both files, so this shows just the lines that are unique to the second file.
This will do your job:
diff file1 file2 |awk '{print $2}'
You could try:
diff file1 file2 | awk '{print $2}' | grep -v '^$' > output.file

Problems combining awk scripts

I am trying to use awk to parse a tab delimited table -- there are several duplicate entries in the first column, and I need to remove the duplicate rows that have a smaller total sum of the other 4 columns in the table. I can remove the first or second row easily, and sum the columns, but I'm having trouble combining the two. For my purposes there will never be more than 2 duplicates.
Example file: http://pastebin.com/u2GBnm2D
Desired output in this case would be to remove the rows:
lmo0330 1 1 0 1
lmo0506 7 21 2 10
And keep the other two rows with the same gene id in the column. The final parsed file would look like this: http://pastebin.com/WgDkm5ui
Here's what I have tried (this doesn't do anything. But the first part removes the second duplicate, and the second part sums the counts):
awk 'BEGIN {!a[$1]++} {for(i=1;i<=NF;i++) t+=$i; print t; t=0}'
I tried modifying the 2nd part of the script in the best answer of this question: Removing lines containing a unique first field with awk?
awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile
But unfortunately I don't really understand what's going on well enough to get it working. Can anyone help me out? I think I need to replace the a[$1] > 1 part with [remove (first duplicate count or 2nd duplicate count depending on which is larger].
EDIT: I'm also using GNU Awk 3.1.7 if that matters.
You can use this awk command:
awk 'NR == 1 {
print;
next
} {
s = $2+$3+$4+$5
} s >= sum[$1] {
sum[$1] = s;
if (!($1 in rows))
a[++n] = $1;
rows[$1] = $0
} END {
for(i=1; i<=n; i++)
print rows[a[i]]
}' file | column -t
Output:
gene SRR034450.out.rpkm_0 SRR034451.out.rpkm_0 SRR034452.out.rpkm_0 SRR034453.out.rpkm_0
lmo0001 160 323 533 293
lmo0002 135 317 504 306
lmo0003 1 4 5 3
lmo0004 35 59 58 48
lmo0005 113 218 257 187
lmo0006 279 519 653 539
lmo0007 563 1053 1165 1069
lmo0008 34 84 203 107
lmo0009 13 45 90 49
lmo0010 57 210 237 169
lmo0011 65 224 247 179
lmo0012 65 226 250 215
lmo0013 342 500 738 682
lmo0014 662 1032 1283 1311
lmo0015 321 413 631 637
lmo0016 175 253 273 325
lmo0017 3 6 6 6
lmo0018 33 38 46 45
lmo0019 13 1 39 1
lmo0020 3 12 28 15
lmo0021 3 4 14 12
lmo0022 2 3 5 1
lmo0023 2 0 3 2
lmo0024 1 0 2 6
lmo0330 1 1 1 3
lmo0506 151 232 60 204

Compare two files having different column numbers and print the requirement to a new file if condition satisfies

I have two files with more than 10000 rows:
File1 has 1 col File2 has 4 col
23 23 88 90 0
34 43 74 58 5
43 54 87 52 3
54 73 52 35 4
. .
. .
I want to compare each value in file-1 with that in file-2. If exists then print the value along with other three values in file-2. In this example output will be:
23 88 90 0
43 74 58 5
54 87 52 3
.
.
I have written following script, but it is taking too much time to execute.
s1=1; s2=$(wc -l < File1.txt)
while [ $s1 -le $s2 ]
do n=$(awk 'NR=="$s1" {print $1}' File1.txt)
p1=1; p2=$(wc -l < File2.txt)
while [ $p1 -le $p2 ]
do awk '{if ($1==$n) printf ("%s %s %s %s\n", $1, $2, $3, $4);}'> ofile.txt
(( p1++ ))
done
(( s1++ ))
done
Is there any short/ easy way to do it?
You can do it very shortly using awk as
awk 'FNR==NR{found[$1]++; next} $1 in found'
Test
>>> cat file1
23
34
43
54
>>> cat file2
23 88 90 0
43 74 58 5
54 87 52 3
73 52 35 4
>>> awk 'FNR==NR{found[$1]++; next} $1 in found' file1 file2
23 88 90 0
43 74 58 5
54 87 52 3
What it does?
FNR==NR Checks if FNR file number of record is equal to NR total number of records. This will be same only for the first file, file1 because FNR is reset to 1 when awk reads a new file.
{found[$1]++; next} If the check is true then creates an associative array indexed by $1, the first column in file1
$1 in found This check is only done for the second file, file2. If column 1 value, $1 is and index in associative array found then it prints the entire line ( which is not written because it is the default action)

Resources