How to list or cat files sorted numerically? - cat

I have the following files:
1A34_3_2.pdb
1A34_6_3.pdb
1A34_9_4.pdb
1A34_A_5.pdb
1A34_D_6.pdb
...
1A34_A2_23.pdb
...
1A34_BK_45.pdb
I would like to sort them numerically, based on the numbers at the end of file name [e.g., "6" is the number at the end of file name 1A34_D_6.pdb]
If I use ls -1v 1A34*, the result is:
1A34_3_2.pdb
1A34_6_3.pdb
1A34_9_4.pdb
1A34_A2_23.pdb
1A34_A_5.pdb
1A34_BK_45.pdb
1A34_D_6.pdb
This is not in ascending order of last number in the file name.

Please feed the output of ls command to this filter:
ls | perl -e '#filenames=<>; print sort {substr($a, 7) <=> substr($b, 7)} #filenames'
Output:
1A34_3_2.pdb
1A34_6_3.pdb
1A34_9_4.pdb
1A34_A_5.pdb
1A34_D_6.pdb
1A34_9_14.pdb
1A34_D_16.pdb
#filenames=<>; slurps the output of ls command into the array #filenames.
sort()'s #filenames numerically, starting at the 7th offset of the string to be sorted, for example 1A34_3_2.pdb: sort {substr($a, 7) <=> substr($b, 7)} #filenames
References:
sort()
substr()

Related

Extracting a set of characters for a column in a txt file

I have a bed file (what is a txt file formed by columns separated by tabs). The fourth column has a name followed by numbers. Using the command line (Linux), I would like to get these names without repetition. A provided an example below.
This is my file:
$head example.bed
1 2160195 2161184 SKI_1.2160205.2161174
1 2234406 2234552 SKI_1.2234416.2234542
1 2234713 2234849 SKI_1.2234723.2234839
1 2235268 2235551 SKI_1.2235278.2235541
1 2235721 2236034 SKI_1.2235731.2236024
1 2237448 2237699 SKI_1.2237458.2237689
1 2238005 2238214 SKI_1.2238015.2238204
1 9770503 9770664 PIK3CD_1.9770513.9770654
1 9775588 9775837 PIK3CD_1.9775598.9775827
1 9775896 9776146 PIK3CD_1.9775906.9776136
...
My list should look like this:
SKI_1
PIK3CD_1
...
Could please help me with the code I need to use?
I found the solution years ago with grep but I have lost the document in which I used to save all useful codes.
Given so.txt:
1 2160195 2161184 SKI_1.2160205.2161174
1 2234406 2234552 SKI_1.2234416.2234542
1 2234713 2234849 SKI_1.2234723.2234839
1 2235268 2235551 SKI_1.2235278.2235541
1 2235721 2236034 SKI_1.2235731.2236024
1 2237448 2237699 SKI_1.2237458.2237689
1 2238005 2238214 SKI_1.2238015.2238204
1 9770503 9770664 PIK3CD_1.9770513.9770654
1 9775588 9775837 PIK3CD_1.9775598.9775827
1 9775896 9776146 PIK3CD_1.9775906.9776136
Then the following command should do the trick:
cat so.txt | awk '{split($4,f,".");print f[1];}' | sort -u
$4 is the 4th column
We split the 4th column on the . character. The result is put into the f array
Finally we filter out the duplicates with sort -u
With data in the file bed and using awk:
awk 'NR>1 { split($4,arr,".");bed[arr[1]]="" } END { for (i in bed) { print i } }' bed
Ignore the first line and then split the 4th space delimited field into the array arr based on "." put the first index of this array into another bed array. In the end block, loop through the bed array and print the indexes.
Using grep:
cat example.bed|awk '{print $4}'|grep -oP '\w+_\d+'|sort -u
produces:
PIK3CD_1
SKI_1
and without the cat command (as there are always UUOC advocates here.. ):
awk '{print $4}' < example.bed|grep -oP '\w+_\d+'|sort -u
I have found this:
head WRGL2_hg19_v1.bed | cut -f4 | cut -d "." -f1
Output:
PIK3CD_1
SKI_1

How can i sort the directory content by owner and group rights in terminal?

Owner rights have a priority over group rights during sorting.
i thought something like ls -la | sort -n
The first letter though,that shows the type of the file gets in the way and gets counted as well.
How can i start sorting based on the 2nd column where owner rights start?(not 2nd field,terminal column)
If this is not possible,if there s any other solution for my problem?
Pipe the output of ls through cut?
ls -lA| cut -d ' ' -f2- | sort -n
cut will keep all columns starting with column2 and the output is then piped to sort.
You can use this awk (works only in gawk) program to do it:
NR==2,true {
arr[substr($0, 2)] = $0
}
END {
asorti(arr, sorted)
for (i in sorted)
print arr[sorted[i]]
}
here is how to run it: ls -l | awk -f prog.awk
What it does:
NR==2,true {
takes every line between the second one and the last one (to omit column header)
arr[substr($0, 2)] = $0
take the whole line and save it in associative array called arr under index that is the same as the whole line except it does not have the first letter
END {
after reading all lines
asorti(arr, sorted)
sort the array by the indices (gnu extension) and store the sorted array in sorted. The sorted array is indited by the element's position and the values are the original indices ( the lines without first letter)
for (i in sorted)
print arr[sorted[i]]
iterate the sorted indices and retrieve the original lines from the initial array arr.
The advantage of this method is that it saves all the information about the entries (including whether it's a link or directory or something else).

Scalable way of deleting all lines from a file where the line starts with one of many values

Given an input file of variable values (example):
A
B
D
What is a script to remove all lines from another file which start with one of the above values? For example, the file contents:
A
B
C
D
Would end up being:
C
The input file is of the order of 100,000 variable values. The file to be mangled is of the order of several million lines.
awk '
NR==FNR { # IF this is the first file in the arg list THEN
list[$0] # store the contents of the current record as an index or array "list"
next # skip the rest of the script and so move on to the next input record
} # ENDIF
{ # This MUST be the second file in the arg list
for (i in list) # FOR each index "i" in array "list" DO
if (index($0,i) == 1) # IF "i" starts at the 1st char on the current record THEN
next # move on to the next input record
}
1 # Specify a true condition and so invoke the default action of printing the current record.
' file1 file2
An alternative approach to building up an array and then doing a string comparison on each element would be to build up a Regular Expression, e.g.:
...
list = list "|" $0
...
and then doing an RE comparison:
...
if ($0 ~ list)
next
...
but I'm not sure that'd be any faster than the loop and you'd then have to worry about RE metacharacters appearing in file1.
If all of your values in file1 are truly single characters, though, then this approach of creating a character list to use in an RE comparison might work well for you:
awk 'NR==FNR{list = list $0; next} $0 !~ "^[" list "]"' file1 file2
You can also achieve this using egrep:
egrep -vf <(sed 's/^/^/' file1) file2
Lets see it in action:
$ cat file1
A
B
$ cat file2
Asomething
B1324
C23sd
D2356A
Atext
CtestA
EtestB
Bsomething
$ egrep -vf <(sed 's/^/^/' file1) file2
C23sd
D2356A
CtestA
EtestB
This would remove lines that start with one of the values in file1.
You can use comm to display the lines that are not common to both files, like this:
comm -3 file1 file2
Will print:
C
Notice that for this for this to work, both files have to be sorted, if they aren't sorted you can bypass that using
comm -3 <(sort file1) <(sort file2)

Bash CSV sorting and unique-ing

a Linux question: I have the CSV file data.csv with the following fields and values
KEY,LEVEL,DATA
2.456,2,aaa
2.456,1,zzz
0.867,2,bbb
9.775,4,ddd
0.867,1,ccc
2.456,0,ttt
...
The field KEY is a float value, while LEVEL is an integer. I know that the first field can have repeated values, as well as the second one, but if you take them together you have a unique couple.
What I would like to do is to sort the file according to the column KEY and then for each unique value under KEY keep only the row having the higher value under LEVEL.
Sorting is not a problem:
$> sort -t, -k1,2 data.csv # fields: KEY,LEVEL,DATA
0.867,1,ccc
0.867,2,bbb
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd
...
but then how can I filter the rows so that I get what I want, which is:
0.867,2,bbb
2.456,2,aaa
9.775,4,ddd
...
Is there a way to do it using command line tools like sort, uniq, awk and so on? Thanks in advance
try this line:
your sort...|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
output:
kent$ echo "0.867,1,bbb
0.867,2,ccc
2.456,0,ttt
2.456,1,zzz
2.456,2,aaa
9.775,4,ddd"|awk -F, 'k&&k!=$1{print p}{p=$0;k=$1}END{print p}'
0.867,2,ccc
2.456,2,aaa
9.775,4,ddd
The idea is, because your file is already sorted, just go through the file/input from top, if the first column (KEY) changed, print the last line, which is the highest value of LEVEL of last KEY
try with your real data, it should work.
also the whole logic (with your sort) could be done by awk in single process.
Use:
$> sort -r data.csv | uniq -w 5 | sort
given your floats are formatted "0.000"-"9.999"
Perl solution:
perl -aF, -ne '$h{$F[0]} = [#F[1,2]] if $F[1] > $h{$F[0]}[0]
}{
print join ",", $_, #{$h{$_}} for sort {$a<=>$b} keys %h' data.csv
Note that the result is different from the one you requested, the first line contains bbb, not ccc.

get 1 line with the same field in a file using shell script

I have a file, with contents like:
onelab2.warsaw.rd.tp.pl 5
onelab3.warsaw.rd.tp.pl 5
lefthand.eecs.harvard.edu 7
righthand.eecs.harvard.edu 7
planetlab2.netlab.uky.edu 8
planet1.scs.cs.nyu.edu 9
planetx.scs.cs.nyu.edu 9
so for each line, there is a number I want the 1st line for each number so for
the content above, I want to get:
onelab2.warsaw.rd.tp.pl 5
lefthand.eecs.harvard.edu 7
planetlab2.netlab.uky.edu 8
planet1.scs.cs.nyu.edu 9
How can I achieve this? I hope for shell scripts, with awk, sed, etc.
This might work for you (GNU sort):
sort -nsuk2 file
Sort the -k2 second field -n numerically keeping the -s original order and -u remove duplicates.
Use the awk command for that:
awk '{if(!a[$2]){a[$2]=1; print}}' file.dat
Explanation:
{
# 'a' is a lookup table (array) which will contain all numbers
# that have been printed so far. It will be initialized as an empty
# array on its first usage by awk. So you don't have to care about.
# $2 is the second 'column' in the line -> the number
if(!a[$2])
{
# set index in the lookup table. This way the if statement will
# fail for the next line with the same number at the end
a[$2]=1;
# print the whole current line
print
}
}
With sort and uniq:
sort -n -k2 input | uniq -f1
perl -ane 'print unless $a{$F[1]}++' file

Resources