Sorting and Uniq - linux

I have a file that I'm trying to sort based on the fourth column and removing duplicate lines based on that column as well. My file looks like this after I used sort -uk4,4:
chr1 76190472 76190502 NM_000016_cds_0_0_chr1_76190473_f 0 +
chr1 76226806 76227055 NM_000016_cds_10_0_chr1_76226807_f 0 +
chr1 76228376 76228448 NM_000016_cds_11_0_chr1_76228377_f 0 +
chr1 76194085 76194173 NM_000016_cds_1_0_chr1_76194086_f 0 +
chr1 76198328 76198426 NM_000016_cds_2_0_chr1_76198329_f 0 +
chr1 76198537 76198607 NM_000016_cds_3_0_chr1_76198538_f 0 +
chr1 76199212 76199313 NM_000016_cds_4_0_chr1_76199213_f 0 +
chr1 76200475 76200556 NM_000016_cds_5_0_chr1_76200476_f 0 +
chr1 76205664 76205795 NM_000016_cds_6_0_chr1_76205665_f 0 +
chr1 76211490 76211599 NM_000016_cds_7_0_chr1_76211491_f 0 +
chr1 76215103 76215244 NM_000016_cds_8_0_chr1_76215104_f 0 +
chr1 76216135 76216231 NM_000016_cds_9_0_chr1_76216136_f 0 +
However, the function has not sorted as a I would prefer because after the _cds_ in the fourth column I would like the numbers in descending order: 0, 1, 2, 3...etc instead of 0, 10, 11, 1. Is there any way to do such a thing?

Your requirements aren't completly clear for me but it is likely that you want this:
sort -k4n file
-n sorts using numerical order.

You can extract just that number, put it in a new (integer) variable, and then sort based on that variable. I think the problem is that right now the number is just part of a string.

Related

Expand each line of text file according to their corresponding numbers on linux

Can I transfer this first format to the second one just by basic shell procession or awk or sed on linux?
This is a toy example:
This kind of text file is what I have, three cols, col2 and col3 like range, left close and right open,
chr1 0 2 0
chr1 2 6 1.5
chr2 0 3 0
chr2 3 10 2.1
Transfer to describe each position as:
chr1 0 0
chr1 1 0
chr1 2 1.5
chr1 3 1.5
chr1 4 1.5
chr1 5 1.5
chr2 0 0
chr2 1 0
chr2 2 0
chr2 3 2.1
...
chr2 9 2.1
This can be done by awk,
awk '{for(i=$2;i<$3;i++)print $1,i,$4}' file
Set the start and end of the range as $2 and $3, respectively.
And Print as request for the range in each line.
Another option is to use set and map operations with bedops, bedmap, and cut:
$ bedops --chop 1 foo.bed | bedmap --faster --echo --echo-map-id --delim "\t" - foo.bed | cut -f1,2,4 > answer.txt
Might offer some flexibility if other types of divisions and signal mapping are needed.

change in the text file in linux command line

I have a big file like this example:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
uc001aaa.3 chr1 + 11873 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
uc010nxr.1 chr1 + 11873 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
uc010nxq.1 chr1 + 11873 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
uc009vis.3 chr1 - 14361 16765 14361 14361 4 14361,14969,15795,16606, 14829,15038,15942,16765, uc009vis.3
I want to change the 4th column. each element in each row in column 4 should be replaced by the element in the same row but from column 5. I want to change this element from column5 and put it in the same row but in column 4. the change would be "(element of column5) - 1".
I am not so familiar with command line in linux(shell). do you know how I can do that in a single line?
here is the expected output:
#name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds proteinID alignID
uc001aaa.3 chr1 + 14408 14409 11873 11873 3 11873,12612,13220, 12227,12721,14409, uc001aaa.3
uc010nxr.1 chr1 + 14408 14409 11873 11873 3 11873,12645,13220, 12227,12697,14409, uc010nxr.1
uc010nxq.1 chr1 + 14408 14409 12189 13639 3 11873,12594,13402, 12227,12721,14409, B7ZGX9 uc010nxq.1
uc009vis.3 chr1 - 16764 16765 14361 14361 4 14361,14969,15795,16606, 14829,15038,15942,16765, uc009vis.3
awk is a great tool for manipulating files like this. It allows processing a file that consists of records of fields; by default records are defined by lines in the file and fields are separated by spaces. The awk command line to do what you want is:
awk '!/^#/ { $4 = $5 - 1 } { print }' <filename>
An awk program is a sequence of pattern-action pairs. If a pattern is omitted the action is performed for all input records, if an action is omitted (not used in this program) the default action is to print the record. Fields are referenced in an awk program as $n where n is the field number. There are several forms of pattern but the one used here is the negation a regular expression that is matched against the whole record. So this program updates the 4th field to be the value of the 5th field minus 1 but only for lines that do not start with a # to avoid messing up the header. Then for all records (because the pattern is omitted) the record is printed. The pattern-action pairs are evaluated in order so the records is printed after updating the 4th field.
save you content in file name a
awk '{if(NR>1){$4=$5-1;print $0}else{print $0}}' a

awk- remove lines if two columns equal

Apologies if this is very obvious- I'm a bash beginner! I would like to remove lines from a file if the 1st column equals the 3rd, and if the 2nd equals the 4th. The structure of this file (file_A) is:
chr1 25000 chr2 16475000 1
chr1 25000 chr2 114325000 1
chr1 25000 chr2 224825000 1
chr1 25000 chr3 196825000 1
To see if I could get lines that satisfy the condition, I ran:
awk '$1==$3 && $2==$4' file_A > test_equal
test_equal contains 54640 lines.
But then when I tried to do everything in one line, i.e.
awk '$1!=$3 && $2!=$4' file_A > test_unequal
test_unequal contains 12767257 lines, while file_A originally had 24384120 lines. Shouldn't there be 24384120-54640 = 24329480 lines in test_unequal?
I'm sure this is a silly mistake on my part, but I'm not sure where I've gone wrong. Thanks very much for your help!

Process multiple files and append them in linux/unix

I have over 100 files with at least 5-8 columns (tab-separated) in each file. I need to extract first three columns from each file and add fourth column with some predefined text and append them.
Let's say I have 3 files: file001.txt, file002.txt, file003.txt.
file001.txt:
chr1 1 2 15
chr2 3 4 17
file002.txt:
chr1 1 2 15
chr2 3 4 17
file003.txt:
chr1 1 2 15
chr2 3 4 17
combined_file.txt:
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
For simplicity I kept file contents same.
My script is as follows:
#!/bin/bash
for i in {1..3}; do
j=$(printf '%03d' $i)
awk 'BEGIN { OFS="\t"}; {print $1,$2,$3}' file${j}.txt | awk -v k="$j" 'BEGIN {print $0"\t$k”}' | cat >> combined_file.txt
done
But the script is giving the following errors:
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
awk: non-terminated string $k”}... at source line 1
context is
<<<
awk: giving up
source line number 2
Can some one help me to figure it out?
You don't need two different awk scripts. And you don't use $ to refer to variables in awk, that's used to refer to input fields (i.e. $k means access the field whose number is in the variable k).
for i in {1..3}; do
j=$(printf '%03d' $i)
awk -v k="$j" -v OFS='\t' '{print $1, $2, $3, k}' file$j.txt
done > combined_file.txt
As pointed out in the comments your problem is youre trying to use odd characters as if they were double quotes. Once you fix that though, you don't need a loop or any of that other complexity all you need is:
$ awk 'BEGIN{FS=OFS="\t"} {$NF="f"ARGIND} 1' file*
chr1 1 2 f1
chr2 3 4 f1
chr1 1 2 f2
chr2 3 4 f2
chr1 1 2 f3
chr2 3 4 f3
The above used GNU awk for ARGIND.

Unix sort string and number together

I have what I think should be a common problem, but I didn't find any good solution for it yet.
I have a file where each line has a chromosome number, a starting position in the chromosome and some related values, like below.
1 1.07299851019 1 1.07299851019 HQ chrY 2845223 + 0.251366120219 46
1 1.06860686763 1 1.06860686763 HQ chr10 88595309 + 0.256830601093 47
1 1.04688316093 3 3.14064948278 HQ chr6 49126474 + 0.295081967213 54
1 1.1563829915 1 1.1563829915 HQ chrX 16428176 + 0.185792349727 34
I want to sort this file using unix sort command both on chromosome (column 6) and starting position (column 7). After searching around I came up with this, which got me fairly close:
nohup sort -t $'\t' -k 6.4,6.5n -k 7,7n
The remaining problem that I can't solve is that while chromosomes numbered with a number is sorted alright chromosome X and chromosome Y are sorted together on starting position like this:
1 0.978579587641 9 8.80721628876 HQ chrX 2861057 - 0.431693989071 79
1 0.979500536702 1 0.979500536702 HQ chrY 2861314 - 0.420765027322 77
1 0.969979601694 9 8.72981641525 HQ chrX 2861649 - 0.469945355191 86
I know it would be possible to solve e.g. by replacing chrX and chrY with numbers, or write a program to solve it, but it would be super nice to be able to use a simple command, especially since the file sizes often are huge and I do this repeatedly.
It would also be nice if the chromosomes line up in order 1 to 22 and then X and then Y. My command had chromosome X and Y coming first and then chromosome 1 to 22.
To separate X from Y, you can specify a fallback key:
nohup sort -t $'\t' -k 6.4,6.5n -k 6 -k 7,7n
(this says that if two rows are equivalent in the field 6.4,6.5 as compared numerically, then the next step is to compare them in the field 6 non-numerically, before trying field 7).
Disclaimer: this doesn't satisfy the goal in your last paragraph:
It would also be nice if the chromosomes line up in order 1 to 22 and then X and then Y. My command had chromosome X and Y coming first and then chromosome 1 to 22.
because X and Y will still be treated as zero during the numeric sort, and the fallback won't change that. Hopefully you find it useful anyway.
I know it would be possible to solve e.g. by replacing chrX and chrY with numbers, […]
Indeed, you can do that replacement on the fly:
sed 's/chrX/chr23/; s/chrY/chr24/' |
sort -t $'\t' -k 6.4,6.5n -k 7,7n |
sed 's/chr23/chrX/; s/chr24/chrY/'
(Note that the line-breaks in this command are optional; I included them for readability, but you can put this on one line, if you want, if/when you actually use it.)
If your version of sort supports the -V option which is meant for sorting alphanumeric columns then you can do something like:
$ cat file
1 1.07299851019 1 1.07299851019 HQ chrY 2845223 + 0.251366120219 46
1 1.06860686763 1 1.06860686763 HQ chr10 88595309 + 0.256830601093 47
1 1.04688316093 3 3.14064948278 HQ chr6 49126474 + 0.295081967213 54
1 1.1563829915 1 1.1563829915 HQ chrX 16428176 + 0.185792349727 34
$ sort -t$'\t' -k6V -k7n file
1 1.04688316093 3 3.14064948278 HQ chr6 49126474 + 0.295081967213 54
1 1.06860686763 1 1.06860686763 HQ chr10 88595309 + 0.256830601093 47
1 1.1563829915 1 1.1563829915 HQ chrX 16428176 + 0.185792349727 34
1 1.07299851019 1 1.07299851019 HQ chrY 2845223 + 0.251366120219 46
Elaborating on jaypal's answer from before...
You can change the sort criteria per column like so:
sort -k1,1V input.txt
This will sort column 1 and only column 1 using the aforementioned -V option which is as follows quoted from here.
What -V means is “natural sort of (version) numbers within text” (type
man sort to find out), and it magically orders numbers and texts.
If you have multiple columns in a tab delimited file and you want to specify the primary column sort order you can do something like the following:
sort -k14,14V -k1,1n input.txt
The above will use column 14 as the first sort index and apply the -V sorting alogrithm, then will use column 1 as the secondary sort index and use numeric sorting. (This might be useful in some circles for sorting by chromosome and then position).
To address the missing -V option for OSX users:
The Mac OS X native sort does not support -V, you’ll have
to install GNU core utilities and use gsort instead.
For a quick look at how -V sorting will work you can see the below example...
Example input:
chr21
chr2
chr3
chrY
chr1
chr3
chr10
chrX
V sorted output:
chr1
chr2
chr3
chr3
chr10
chr21
chrX
chrY

Resources