I have what I think should be a common problem, but I didn't find any good solution for it yet.
I have a file where each line has a chromosome number, a starting position in the chromosome and some related values, like below.
1 1.07299851019 1 1.07299851019 HQ chrY 2845223 + 0.251366120219 46
1 1.06860686763 1 1.06860686763 HQ chr10 88595309 + 0.256830601093 47
1 1.04688316093 3 3.14064948278 HQ chr6 49126474 + 0.295081967213 54
1 1.1563829915 1 1.1563829915 HQ chrX 16428176 + 0.185792349727 34
I want to sort this file using unix sort command both on chromosome (column 6) and starting position (column 7). After searching around I came up with this, which got me fairly close:
nohup sort -t $'\t' -k 6.4,6.5n -k 7,7n
The remaining problem that I can't solve is that while chromosomes numbered with a number is sorted alright chromosome X and chromosome Y are sorted together on starting position like this:
1 0.978579587641 9 8.80721628876 HQ chrX 2861057 - 0.431693989071 79
1 0.979500536702 1 0.979500536702 HQ chrY 2861314 - 0.420765027322 77
1 0.969979601694 9 8.72981641525 HQ chrX 2861649 - 0.469945355191 86
I know it would be possible to solve e.g. by replacing chrX and chrY with numbers, or write a program to solve it, but it would be super nice to be able to use a simple command, especially since the file sizes often are huge and I do this repeatedly.
It would also be nice if the chromosomes line up in order 1 to 22 and then X and then Y. My command had chromosome X and Y coming first and then chromosome 1 to 22.
To separate X from Y, you can specify a fallback key:
nohup sort -t $'\t' -k 6.4,6.5n -k 6 -k 7,7n
(this says that if two rows are equivalent in the field 6.4,6.5 as compared numerically, then the next step is to compare them in the field 6 non-numerically, before trying field 7).
Disclaimer: this doesn't satisfy the goal in your last paragraph:
It would also be nice if the chromosomes line up in order 1 to 22 and then X and then Y. My command had chromosome X and Y coming first and then chromosome 1 to 22.
because X and Y will still be treated as zero during the numeric sort, and the fallback won't change that. Hopefully you find it useful anyway.
I know it would be possible to solve e.g. by replacing chrX and chrY with numbers, […]
Indeed, you can do that replacement on the fly:
sed 's/chrX/chr23/; s/chrY/chr24/' |
sort -t $'\t' -k 6.4,6.5n -k 7,7n |
sed 's/chr23/chrX/; s/chr24/chrY/'
(Note that the line-breaks in this command are optional; I included them for readability, but you can put this on one line, if you want, if/when you actually use it.)
If your version of sort supports the -V option which is meant for sorting alphanumeric columns then you can do something like:
$ cat file
1 1.07299851019 1 1.07299851019 HQ chrY 2845223 + 0.251366120219 46
1 1.06860686763 1 1.06860686763 HQ chr10 88595309 + 0.256830601093 47
1 1.04688316093 3 3.14064948278 HQ chr6 49126474 + 0.295081967213 54
1 1.1563829915 1 1.1563829915 HQ chrX 16428176 + 0.185792349727 34
$ sort -t$'\t' -k6V -k7n file
1 1.04688316093 3 3.14064948278 HQ chr6 49126474 + 0.295081967213 54
1 1.06860686763 1 1.06860686763 HQ chr10 88595309 + 0.256830601093 47
1 1.1563829915 1 1.1563829915 HQ chrX 16428176 + 0.185792349727 34
1 1.07299851019 1 1.07299851019 HQ chrY 2845223 + 0.251366120219 46
Elaborating on jaypal's answer from before...
You can change the sort criteria per column like so:
sort -k1,1V input.txt
This will sort column 1 and only column 1 using the aforementioned -V option which is as follows quoted from here.
What -V means is “natural sort of (version) numbers within text” (type
man sort to find out), and it magically orders numbers and texts.
If you have multiple columns in a tab delimited file and you want to specify the primary column sort order you can do something like the following:
sort -k14,14V -k1,1n input.txt
The above will use column 14 as the first sort index and apply the -V sorting alogrithm, then will use column 1 as the secondary sort index and use numeric sorting. (This might be useful in some circles for sorting by chromosome and then position).
To address the missing -V option for OSX users:
The Mac OS X native sort does not support -V, you’ll have
to install GNU core utilities and use gsort instead.
For a quick look at how -V sorting will work you can see the below example...
Example input:
chr21
chr2
chr3
chrY
chr1
chr3
chr10
chrX
V sorted output:
chr1
chr2
chr3
chr3
chr10
chr21
chrX
chrY
Related
Can I transfer this first format to the second one just by basic shell procession or awk or sed on linux?
This is a toy example:
This kind of text file is what I have, three cols, col2 and col3 like range, left close and right open,
chr1 0 2 0
chr1 2 6 1.5
chr2 0 3 0
chr2 3 10 2.1
Transfer to describe each position as:
chr1 0 0
chr1 1 0
chr1 2 1.5
chr1 3 1.5
chr1 4 1.5
chr1 5 1.5
chr2 0 0
chr2 1 0
chr2 2 0
chr2 3 2.1
...
chr2 9 2.1
This can be done by awk,
awk '{for(i=$2;i<$3;i++)print $1,i,$4}' file
Set the start and end of the range as $2 and $3, respectively.
And Print as request for the range in each line.
Another option is to use set and map operations with bedops, bedmap, and cut:
$ bedops --chop 1 foo.bed | bedmap --faster --echo --echo-map-id --delim "\t" - foo.bed | cut -f1,2,4 > answer.txt
Might offer some flexibility if other types of divisions and signal mapping are needed.
I have a big file like this example:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
uc009vis.3 chr1 - 14361 16765 14361 14361 4 14361,14969,15795,16606, 14829,15038,15942,16765, uc009vis.3
I want to change the 4th column. each element in each row in column 4 should be replaced by the element in the same row but from column 5. I want to change this element from column5 and put it in the same row but in column 4. the change would be "(element of column5) - 1".
I am not so familiar with command line in linux(shell). do you know how I can do that in a single line?
here is the expected output:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
uc001aaa.3 chr1 + 14408 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
uc010nxr.1 chr1 + 14408 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
uc010nxq.1 chr1 + 14408 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
uc009vis.3 chr1 - 16764 16765 14361 14361 4 14361,14969,15795,16606, 14829,15038,15942,16765, uc009vis.3
awk is a great tool for manipulating files like this. It allows processing a file that consists of records of fields; by default records are defined by lines in the file and fields are separated by spaces. The awk command line to do what you want is:
awk '!/^#/ { $4 = $5 - 1 } { print }' <filename>
An awk program is a sequence of pattern-action pairs. If a pattern is omitted the action is performed for all input records, if an action is omitted (not used in this program) the default action is to print the record. Fields are referenced in an awk program as $n where n is the field number. There are several forms of pattern but the one used here is the negation a regular expression that is matched against the whole record. So this program updates the 4th field to be the value of the 5th field minus 1 but only for lines that do not start with a # to avoid messing up the header. Then for all records (because the pattern is omitted) the record is printed. The pattern-action pairs are evaluated in order so the records is printed after updating the 4th field.
save you content in file name a
awk '{if(NR>1){$4=$5-1;print $0}else{print $0}}' a
Apologies if this is very obvious- I'm a bash beginner! I would like to remove lines from a file if the 1st column equals the 3rd, and if the 2nd equals the 4th. The structure of this file (file_A) is:
chr1 25000 chr2 16475000 1
chr1 25000 chr2 114325000 1
chr1 25000 chr2 224825000 1
chr1 25000 chr3 196825000 1
To see if I could get lines that satisfy the condition, I ran:
awk '$1==$3 && $2==$4' file_A > test_equal
test_equal contains 54640 lines.
But then when I tried to do everything in one line, i.e.
awk '$1!=$3 && $2!=$4' file_A > test_unequal
test_unequal contains 12767257 lines, while file_A originally had 24384120 lines. Shouldn't there be 24384120-54640 = 24329480 lines in test_unequal?
I'm sure this is a silly mistake on my part, but I'm not sure where I've gone wrong. Thanks very much for your help!
I have over 100 files with at least 5-8 columns (tab-separated) in each file. I need to extract first three columns from each file and add fourth column with some predefined text and append them.
Let's say I have 3 files: file001.txt, file002.txt, file003.txt.
file001.txt:
chr1 1 2 15
chr2 3 4 17
file002.txt:
chr1 1 2 15
chr2 3 4 17
file003.txt:
chr1 1 2 15
chr2 3 4 17
combined_file.txt:
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
For simplicity I kept file contents same.
My script is as follows:
#!/bin/bash
for i in {1..3}; do
j=$(printf '%03d' $i)
awk 'BEGIN { OFS="\t"}; {print $1,$2,$3}' file${j}.txt | awk -v k="$j" 'BEGIN {print $0"\t$k”}' | cat >> combined_file.txt
done
But the script is giving the following errors:
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
Can some one help me to figure it out?
You don't need two different awk scripts. And you don't use $ to refer to variables in awk, that's used to refer to input fields (i.e. $k means access the field whose number is in the variable k).
for i in {1..3}; do
j=$(printf '%03d' $i)
awk -v k="$j" -v OFS='\t' '{print $1, $2, $3, k}' file$j.txt
done > combined_file.txt
As pointed out in the comments your problem is youre trying to use odd characters as if they were double quotes. Once you fix that though, you don't need a loop or any of that other complexity all you need is:
$ awk 'BEGIN{FS=OFS="\t"} {$NF="f"ARGIND} 1' file*
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
The above used GNU awk for ARGIND.
I have a TAB delimited table like this (the first line is header):
symbol value chr start end
Arrb1 10 chr1 1000 2000
Arrb1 20 chr1 1000 2000
Arrb1 30 chr1 1000 2000
Myc 5 chr2 3000 4000
Actin 3 chr4 25000 30000
Actin 5 chr4 25000 30000
.
.
.
I want to unique the table by the first column(symbol), and if there are multiple lines for the same symbol, keep the line with biggest value (column 2). So the result should look like:
symbol value chr start end
Arrb1 30 chr1 1000 2000
Myc 5 chr2 3000 4000
Actin 5 chr4 25000 30000
.
.
.
Can I do it using AWK? Thanks!
awk -F'\t' 'NR==1{print}
NR>1{if(b[$1]<$2){ a[$1]=$0; b[$1]=$2 }}
END{for(x in a)print a[x]}' file
If no header. I provide a shorter one.
sort -k1,1 -k2,2nr file |awk '!a[$1]++'