Recode value in the column in unix with awk or sed - linux

In the following file, The values of 6th column for the rows who have values other than 1 or 2 in sixth column should be replaced with -9. How can I do it?
old.fam
18_0033 26210 0 0 1 1
18_0036 24595 0 0 1 2
18_0040 25563 0 0 1
18_0041 35990 0 0 0 -8
18_0042 39398 0 0 0 -8
18_0045 21586 0 0 1 1
18_0050 22211 0 0 1 2
new.fam should be
18_0033 26210 0 0 1 1
18_0036 24595 0 0 1 2
18_0040 25563 0 0 1 -9
18_0041 35990 0 0 0 -9
18_0042 39398 0 0 0 -9
18_0045 21586 0 0 1 1
18_0050 22211 0 0 1 2
Edit: I used cat old.fam | awk '{ if ($6==1 || $6==2) {print $1 " " $2 " " $3 " " $4 " " $5 " " $6 ;} else {print $1 " " $2 " " $3 " " $4 " " $5 " " -9;}}'> new.fam
Now the problem is the rows with replaced 6th column value (-9), does not have space separated FS between 5th and 6th column.
18_0033 26210 0 0 1 1
18_0036 24595 0 0 1 2
18_0040 25563 0 0 1-9
18_0041 35990 0 0 0-9
18_0042 39398 0 0 0-9
18_0045 21586 0 0 1 1
18_0050 22211 0 0 1 2

Here you have something you can start working on:
cat test.txt | awk '{if ($6==1||$6==2) {print $1 " " $6;} else {print $1 " -9";}}'
The awk script does the following:
check the value of the sixth column
between both checks, there's the awk || logical OR operator
The rest of the script is obvious.
Edit
Apparently awk can't handle spaces, followed by numbers, so you might use this awk script:
awk '{ if ( $6==1|| $6==2) {print $1 " " $2 " " $3 " " $4 " " $5 " " $6 ;} else
{print $1 " " $2 " " $3 " " $4 " " $5 " -9";}}'
(Mind the $5 " -9" at the end)

Related

Executing Concatenation for all rows

I'm working with GWAS data.
Using p-link command I was able to get SNPslist, SNPs.map, SNPs.ped.
Here are the data files and commands I have for 2 SNPs (rs6923761, rs7903146):
$ cat SNPs.map
0 rs6923761 0 0
0 rs7903146 0 0
$ cat SNPs.ped
6 6 0 0 2 2 G G C C
74 74 0 0 2 2 A G T C
421 421 0 0 2 2 A G T C
350 350 0 0 2 2 G G T T
302 302 0 0 2 2 G G C C
bash commands I used:
echo -n IID > SNPs.csv
cat SNPs.map | awk '{printf ",%s", $2}' >> SNPs.csv
echo >> SNPs.csv
cat SNPs.ped | awk '{printf "%s,%s%s,%s%s\n", $1, $7, $8, $9, $10}' >> SNPs.csv
cat SNPs.csv
Output:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
This is about 2 SNPs, so I can see manually their position so I added and called using the above command. But now I have 2000 SNPs IDs and their values. Need help with bash command which can parse over 2000 SNPs in the same way.
One awk idea that replaces all of the current code:
awk '
BEGIN { printf "IID" }
# process 1st file:
FNR==NR { printf ",%s", $2; next }
# process 2nd file:
FNR==1 { print "" } # terminate 1st line of output
{ printf $1 # print 1st column
for (i=7;i<=NF;i=i+2) # loop through columns 7-NF, incrementing index +2 on each pass
printf ",%s%s", $i, $(i+1) # print (i)th and (i+1)th columns
print "" # terminate line
}
' SNPs.map SNPs.ped
NOTE: remove comments to declutter code
This generates:
IID,rs6923761,rs7903146
6,GG,CC
74,AG,TC
421,AG,TC
350,GG,TT
302,GG,CC
You can use --recodeA flag in plink to have your IID as rows and SNPs as columns.

Insert a row and a column in a matrix using awk

I have a gridded dataset with 250 rows x 300 columns in matrix form:
ifile.txt
2 3 4 1 2 3
3 4 5 2 4 6
2 4 0 5 0 7
0 0 5 6 3 8
I would like to insert the latitude values at the first column and longitude values at the top. Which looks like:
ofile.txt
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
The increment is 0.33
I can do it for a small size matrix in manually, but I can't able to get any idea how to get my output in my desired format. I was writing a script in the following way, but completely useless.
echo 20 > latitude.txt
for i in `seq 1 250`;do
i1=$(( i + 0.33 )) #bash can't recognize fractions
echo $i1 >> latitude.txt
done
echo 100 > longitude.txt
for j in `seq 1 300`;do
j1=$(( j + 0.33 ))
echo $j1 >> longitude.txt
done
paste longitude.txt ifile.txt > dummy_file.txt
cat latitude.txt dummy_file.txt > ofile.txt
$ cat tst.awk
BEGIN {
lat = 100
lon = 20
latWid = lonWid = 6
latDel = lonDel = 0.33
latFmt = lonFmt = "%*.2f"
}
NR==1 {
printf "%*s", latWid, ""
for (i=1; i<=NF; i++) {
printf lonFmt, lonWid, lon
lon += lonDel
}
print ""
}
{
printf latFmt, latWid, lat
lat += latDel
for (i=1; i<=NF; i++) {
printf "%*s", lonWid, $i
}
print ""
}
$ awk -f tst.awk file
20.00 20.33 20.66 20.99 21.32 21.65
100.00 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
Following awk may also help you on same.
awk -v col=100 -v row=20 'FNR==1{printf OFS;for(i=1;i<=NF;i++){printf row OFS;row=row+.33;};print ""} {col+=.33;$1=$1;print col OFS $0}' OFS="\t" Input_file
Adding non one liner form of above solution too now:
awk -v col=100 -v row=20 '
FNR==1{
printf OFS;
for(i=1;i<=NF;i++){
printf row OFS;
row=row+.33;
};
print ""
}
{
col+=.33;
$1=$1;
print col OFS $0
}
' OFS="\t" Input_file
Awk solution:
awk 'NR == 1{
long = 20.00; lat = 100.00; printf "%12s%.2f", "", long;
for (i=1; i<NF; i++) { long += 0.33; printf "\t%.2f", long } print "" }
NR > 1{ lat += 0.33 }
{
printf "%.2f%6s", lat, "";
for (i=1; i<=NF; i++) printf "\t%d", $i; print ""
}' file
With perl
$ perl -lane 'print join "\t", "", map {20.00+$_*0.33} 0..$#F if $.==1;
print join "\t", 100+(0.33*$i++), #F' ip.txt
20 20.33 20.66 20.99 21.32 21.65
100 2 3 4 1 2 3
100.33 3 4 5 2 4 6
100.66 2 4 0 5 0 7
100.99 0 0 5 6 3 8
-a to auto-split input on whitespaces, result saved in #F array
See https://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
if $.==1 for the first line of input
map {20.00+$_*0.33} 0..$#F iterate based on size of #F array, and for each iteration, we get a value based on equation inside {} where $_ will be 0, 1, etc upto last index of #F array
print join "\t", "", map... use tab separator to print empty element and results of map
For all the lines, print contents of #F array pre-fixed with results of 100+(0.33*$i++) where $i will be initially 0 in numeric context. Again, tab is used as separator while joining these values
Use sprintf if needed for formatting, also $, can be initialized instead of using join
perl -lane 'BEGIN{$,="\t"; $st=0.33}
print "", map { sprintf "%.2f", 20+$_*$st} 0..$#F if $.==1;
print sprintf("%.2f", 100+($st*$i++)), #F' ip.txt

add header to columns from list text file awk

I have a very large text file with hundreds of columns. I want to add a header to every column from an independent text file containing a list.
My large file looks like this:
largefile.txt
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
my list of headers:
headers.txt
h1
h2
h3
wanted output:
output.txt
h1 h2 h3 h4 h5 h6 h7 etc..
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
$ awk 'NR==FNR{h=h OFS $0; next} FNR==1{print OFS OFS h} 1' head large | column -s ' ' -t
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
or if you prefer:
$ awk -v OFS='\t' 'NR==FNR{h=h OFS $0; next} FNR==1{print OFS OFS h} {$1=$1}1' head large
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Well, here's one. OFS is tab for eye candy. From the OP I concluded that the headers should start from the fourth field, hence +3s in the code.
$ awk -v OFS="\t" ' # tab OFS
NR==FNR { a[NR]=$1; n=NR; next } # has headers
FNR==1 { # print headers in the beginning of 2nd file
$1=$1 # rebuild record for tabs
b=$0 # buffer record
$0="" # clear record
for(i=1;i<=n;i++) # spread head to fields
$(i+3)=a[i]
print $0 ORS b # output head and buffered first record
}
{ $1=$1 }1' head data # implicit print with record rebuild
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Then again, this would also do the trick:
$ awk 'NR==FNR{h=h (NR==1?"":OFS) $0;next}FNR==1{print OFS OFS OFS h}1' head date
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc
Use paste to pivot the headers into a single line and then cat them together with the main file (- instead of a file name means stdin to cat):
$ paste -s -d' ' headers.txt | cat - largefile.txt
If you really need the headers to line up as in your example output you can preprocess (either manually or with a command) the headers file, or you can finish with sed (for just one option) as below:
$ paste -s -d' ' headers.txt | cat - largefile.txt | sed '1 s/^/ /'
h1 h2 h3
chrom start end 0 1 0 1 0 0 0 etc
chrom start end 0 0 0 0 1 1 1 etc
chrom start end 0 0 0 1 1 1 1 etc

How to find common rows in multiple files using awk

I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns.
Sample files:
file1.txt
aba 0 0
aba 0 0 1
abc 0 1
abd 1 1
xxx 0 0
file2.txt
xyz 0 0
aba 0 0 0 0
aba 0 0 0 1
xxx 0 0
abc 1 1
file3.txt
xyx 0 0
aba 0 0
aba 0 1 0
xxx 0 0 0 1
abc 1 1
The below code does the same and returns the rows only if the key column is found in all the N files (3 files in this case).
awk '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
num_files = ARGC -1
for ( key in arr ) {
if ( arr[key] < num_files ) { continue }
split( line[ key ], line_arr, SUBSEP )
for ( i = 1; i <= length( line_arr ); i++ ) {
printf "%s\n", line_arr[ i ]
}
}
}
' *.txt > commoninall.txt
Output:
xxx 0 0
aba 0 0
aba 0 0 1
However, now I would like to get the output if 'x' files have the key columns.
For example x=2 i.e. rows which are common in two files based on key columns 1 and 2. The output in this case would be:
xyz 0 0
abc 1 1
In real scenario I do have to specify different values for x. Can anybody suggest an edit to this or a new solution.
First attempt
I think you just need to modify the END block a little, and the command invocation:
awk -v num_files=${x:-0} '
…
…script as before…
…
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
'
Basically, this takes a command line parameter based on $x, defaulting to 0, and assigning it to the awk variable num_files. In the END block, the code checks for num_files being zero, and resets it to the number of files passed on the command line. (Interestingly, the value in ARGC discounts any -v var=value options and either a command line script or -f script.awk, so the ARGC-1 term remains correct. The array ARGV contains awk (or whatever name you invoked it with) in ARGV[0] and the files to be processed in ARGV[1] through ARGV[ARGC-1].) The loop then checks for the required number of matches and prints as before. You can change == to >= if you want the 'or more' option.
Does it work?
I observed in a comment:
I'm not clear what you are asking. I took it that your code was working for the example with three files and producing the right answer. I simply suggested how to modify the working code to handle N files and at least M of them sharing an entry. I have just realized, while typing this, that there is a bit more work to do. An entry could be missing from the first file but present in the others and will need to be processed, therefore. It is easy to report all occurrences in every file, or the first occurrence in any file. It is harder to report all occurrences only in the first file with a key.
The response was:
It is perfectly fine to report first occurrence in any file and need not be only from the first file. However, the issue with the suggested modification is, it is producing the same output for different values of x.
That's curious: I was able to get sane output from the amended code with different values for the number of files where the key must appear. I used this shell script. The code in the awk program up to the END block is the same as in the question; the only change is in the END processing block.
#!/bin/bash
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1;;
esac
done
shift $(($OPTIND - 1))
awk -v num_files=${num_files:-$#} '
FNR == NR {
arr[$1,$2] = 1
line[$1,$2] = line[$1,$2] (line[$1,$2] ? SUBSEP : "") $0
next
}
FNR == 1 { delete found }
{ if (arr[$1,$2] && ! found[$1,$2]) { arr[$1,$2]++; found[$1,$2] = 1 } }
END {
if (num_files == 0) num_files = ARGC - 1
for (key in arr) {
if (arr[key] == num_files) {
split(line[key], line_arr, SUBSEP)
for (i = 1; i <= length(line_arr); i++) {
printf "%s\n", line_arr[i]
}
}
}
}
' "$#"
Sample runs (data files from question):
$ bash common.sh file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 3 file?.txt
xxx 0 0
aba 0 0
aba 0 0 1
$ bash common.sh -n 2 file?.txt
$ bash common.sh -n 1 file?.txt
abc 0 1
abd 1 1
$
That shows different answers depending on the value specified via -n. Note that this only shows lines that appear in the first file and appear in exactly N files in total. The only key that appears in two files (abc/1) does not appear in the first file, so it is not listed by this code which stops paying attention to new keys after the first file is processed.
Rewrite
However, here's a rewrite, using some of the same ideas, but working more thoroughly.
#!/bin/bash
# SO 30428099
# Given that the key for a line is the first two columns, this script
# lists all appearances in all files of a given key if that key appears
# in N different files (where N defaults to the number of files). For
# the benefit of debugging, it includes the file name and line number
# with each line.
usage()
{
echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
exit 1
}
while getopts n: opt
do
case "$opt" in
(n) num_files=$OPTARG;;
(*) usage;;
esac
done
shift $(($OPTIND - 1))
if [ "$#" = 0 ]
then usage
fi
# Record count of each key, regardless of file: keys
# Record count of each key in each file: key_file
# Count of different files containing each key: files
# Accumulate line number, filename, line for each key: lines
awk -v num_files=${num_files:-$#} '
{
keys[$1,$2]++;
if (++key_file[$1,$2,FILENAME] == 1)
files[$1,$2]++
#printf "%s:%d: Key (%s,%s); keys = %d; key_file = %d; files = %d\n",
# FILENAME, FNR, $1, $2, keys[$1,$2], key_file[$1,$2,FILENAME], files[$1,$2];
sep = lines[$1,$2] ? RS : ""
#printf "B: [[\n%s\n]]\n", lines[$1,$2]
lines[$1,$2] = lines[$1,$2] sep FILENAME OFS FNR OFS $0
#printf "A: [[\n%s\n]]\n", lines[$1,$2]
}
END {
#print "END"
for (key in files)
{
#print "Key =", key, "; files =", files[key]
if (files[key] == num_files)
{
#printf "TAG\n%s\nEND\n", lines[key]
print lines[key]
}
}
}
' "$#"
Sample output (given the data files from the question):
$ bash common.sh file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 2 file?.txt
file2.txt 5 abc 1 1
file3.txt 5 abc 1 1
$ bash common.sh -n 1 file?.txt
file1.txt 3 abc 0 1
file3.txt 1 xyx 0 0
file1.txt 4 abd 1 1
file2.txt 1 xyz 0 0
$ bash common.sh -n 3 file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0
file3.txt 3 aba 0 1 0
$ bash common.sh -n 4 file?.txt
$
You can fettle this to give the output you want (probably missing file name and line number). If you only want the lines from the first file containing a given key, you only add the information to lines when files[$1,$2] == 1. You can separate the recorded information with SUBSEP instead of RS and OFS if you prefer.
Can't you simply use uniq to search for repeated lines in you files?
Something like:
cat file1.txt file2.txt file3.txt | uniq -d
For your complete scenario, you could use uniq -c to get the number of repetition for each line, and filter this with grep.

How to select rows in which column two and three are not equal to each other and to 0 or 1?(with awk)

I have a file like this:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474642 0 0
AX-75474643 0.25 0.820513
AX-75448113 1 0
AX-75474641 1 1
and I want to select the rows that column 2 and 3 are not equal each other and 0 or 1 (both of them)! (i.e if column 2 and 3 are similar but equal to 0.5 (or any other number except 0 and 1) I would like to have that row)
so the output would be:
AX-75448119 0 1
AX-75448118 0.45 0.487179
AX-75474643 0.25 0.820513
AX-75448113 1 0
I know how to write the command to select the rows that column 2 and 3 are equal to each other and are equal to 0 or 1 which is this:
awk '$2=$3==1 || $2=$3==0' test.txt | wc -l
but I want exactly the opposite, to select every rows that are not the output of the above command!
Thanks, I hope I was able to explain what I want
It might work for you if I get your requirements correctly.
awk ' $2 != $3 { print; next } $2 == $3 && $2 != 0 && $2 != 1 { print }' INPUTFILE
See it in action at Ideone.com
This might work for you:(?)
awk '($2==0 || $2==1) && ($3==0 || $3==1) && $2==$3{next}1' file

Resources