Random selection of columns using linux command - linux

I have a flat file (.txt) with 606,347 columns and I want to extract 50,000 RANDOM columns, with exception of the first column, which is sample identification. How can I do that using Linux commands?
My file looks like:
ID SNP1 SNP2 SNP3
1 0 0 2
2 1 0 2
3 2 0 1
4 1 1 2
5 2 1 0
It is TAB delimited.
Thank you so much.
Cheers,
Paula.

awk to the rescue!
$ cat shuffle.awk
function shuffle(a,n,k) {
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN {srand()}
NR==1 {shuffle(ar,NF,ncols)}
{for(i=1;i<=ncols;i++) printf "%s", $(ar[i]) FS; print ""}
general usage
$ echo $(seq 5) | awk -f shuffle.awk -v ncols=5
3 4 1 5 2
in your special case you can print $1 and start the function loop from 2.
i.e. change
for(i=1;i<=k;i++) to a[1]=1; for(i=2;i<=k;i++)

Try this:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | xargs -d '\n' | tr ' ' ',' | xargs -I {} cut -d $'\t' -f {} file
Update:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | sed 's/.*/&p/' | sed -nf - <(tr '\t' '\n' <file) | tr '\n' '\t'

#karakfa 's answer is great, but the NF value can't be obtained in the BEGIN{} part of the awk script. Refer to: How to get number of fields in AWK prior to processing
I edited the code as:
head -4 10X.txt | awk '
function shuffle(a,n,k){
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN{
FS=" ";OFS="\t"; ncols=10;
}NR==1{shuffle(tmp_array,NF,ncols);
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]) OFS;
}
print "";
}NR>1{
printf "%s", $1 OFS;
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]+1) OFS;
}
print "";
}'
Because I am processing the single-cell gene expression profiles, so from the second row, the first column will be gene names.
My output is:
D4-2_3095 D6-1_3010 D16-2i_1172 D4-1_337 iPSCs-2i_227 D4-2_170 D12-serum_1742 D4-1_1747 D10-2-2i_1373 D4-1_320
Sox17 0 0 0 0 0 0 0 0 0 0
Mrpl15 0.987862442831866 1.29176904082314 2.12650693025845 0 1.33257747910871 0 1.58815046312948 1.18541326956528 1.12103842107813 0.656789854017254
Lypla1 0 1.29176904082314 0 0 0.443505832809852 0.780385141793088 0.57601629238987 0 0 0.656789854017254

Related

Bash script to isolate words in a file

Here is my initial input data to be extracted:
david ex1=10 ex2=12 quiz1=5 quiz2=9 exam=99
judith ex1=8 ex2=16 quiz1=4 quiz2=10 exam=90
sam ex1=8 quiz1=5 quiz2=11 exam=85
song ex1=8 ex2=20 quiz2=11 exam=87
How do extract each word to be formatted in this way:
david
ex1=10
ex2=12
etc...
As I eventually want to have output like this:
david 12 99
judith 16 90
sam 0 85
song 20 87
when I run my program with the commands:
./marks ex2 exam < file
Supposed your input file is named input.txt, just replace space char by new line char using tr command line tool:
tr ' ' '\n' < input.txt
For your second request, you may have to extract specific field on each line, so the cut and awk commands may be useful (note that my example is certainly improvable):
while read p; do
echo -n "$(echo $p | cut -d ' ' -f1) " # name
echo -n "$(echo $p | cut -d ' ' -f3 | cut -d '=' -f2) " # ex2 val
echo -n $(echo $p | awk -F"exam=" '{ print $2 }') # exam val
echo
done < input.txt
This script does what you want:
#!/bin/bash
a=$#
awk -v a="$a" -F'[[:space:]=]+' '
BEGIN {
split(a, b) # split field names into array b
}
{
printf "%s ", $1 # print first field
for (i in b) { # loop through fields to search for
f = 0 # unset "found" flag
for (j=2; j<=NF; j+=2) # loop though remaining fields, 2 at a time
if ($j == b[i]) { # if field matches value in array
printf "%s ",$(j+1)
f = 1 # set "found" flag
}
if (!f) printf "0 " # add 0 if field not found
}
print "" # add newline
}' file
Testing it out
$ ./script.sh ex2 exam
david 12 99
judith 16 90
sam 0 85
song 20 87

Subtructing n number of columns from two files with AWK

I have two files with N number of columns
File1:
A 1 2 3 ....... Na1
B 2 3 4 ....... Nb1
File2:
A 2 2 4 ....... Na2
B 1 3 4 ....... Nb2
i want a output where 1st column value from File1 will be subtracted from 1st column of File2, and this way till column N as shown below:
A -1 0 -1 ........ (Na1-Na2)
B 1 0 0 ........ (Nb1-Nb2)
How to do this is AWK, or Perl scripting in Linux environment?
This has already been answered, but I will add a one-liner. It uses paste, to concatenate the files, and awk to subtract:
paste file{1,2} | awk '{for (i=1;i<=NF/2;i++) printf "%s ", ($i==$i+0)?$i-$(i+NF/2):$i; print ""}'
Validation:
$ cat file1
A 1 2 3 4 5
B 2 3 4 5 6
$ cat file2
A 2 2 4 10 12
B 1 3 4 3 5
$ paste file{1,2} | awk '{for (i=1;i<=NF/2;i++) printf "%s ", ($i==$i+0)?$i-$(i+NF/2):$i; print ""}'
A -1 0 -1 -6 -7
B 1 0 0 2 1
It requires both files to have the same number of columns. Non-numeric columns should be at the same position. It prints the value in the first file if non-numeric, otherwise prints the difference.
Try:
awk '{split($0,S); getline<f; for(i=2; i<=NF; i++) $i-=S[i]}1' OFS='\t' f=file1 file2
Here's one way using GNU awk. Run like:
awk -f script.awk File2 File1 | rev | column -t | rev
Contents of script.awk:
FNR==NR {
for(i=2;i<=NF;i++) {
a[$1][i]=$i
}
next
}
{
for(j=2;j<=NF;j++) {
$j-=a[$1][j]
}
}1
Alternatively, here's the one-liner:
awk 'FNR==NR { for(i=2;i<=NF;i++) a[$1][i]=$i; next } { for(j=2;j<=NF;j++) $j-=a[$1][j] }1' File2 File1 | rev | column -t | rev
Results:
A -1 0 -1
B 1 0 0
awk 'FNR==NR{for(i=2;i<=NF;i++)a[FNR"-"i]=$i;next}{printf "\n"$1" ";for(i=2;i<=NF;i++){printf $i-a[FNR"-"i]" "}}' file1 file2
> cat file1
A 1 2 3
B 2 3 4
> cat file2
A 2 2 4
B 1 3 4
> awk 'FNR==NR{for(i=2;i<=NF;i++)a[FNR"-"i]=$i;next}{printf "\n"$1" ";for(i=2;i<=NF;i++){printf $i-a[FNR"-"i]" "}}' file1 file2
A 1 0 1
B -1 0 0
>
Alternatively put this in a file
#!/usr/bin/awk
FNR==NR{
for(i=2;i<=NF;i++)
a[FNR"-"i]=$i;next
}
{
printf "\n"$1" ";
for(i=2;i<=NF;i++)
{
printf $i-a[FNR"-"i]" "
}
}
and execute as:
awk -f file.awk file1 file2
Something like this:
use strict;
use warnings;
my (#fh, #v);
for (#ARGV) {
open (my $handle, "<", $_) or die ("$!: $_");
push #fh, $handle;
}
while (#v = map { [split ' ', <$_> ] } #fh and defined shift #{$v[0]}) {
print join(" ", (shift #{$v[1]}, map { $_ - shift(#{$v[1]}) } #{$v[0]})), "\n";
}
close $_ for (#fh);
To run:
perl script.pl input1 input2
Something like this perhaps? I'm afraid I can't test this code as I have no PC to hand at present.
This program expects the names of the two files as parameters on the command line, and outputs the results to STDOUT.
use strict;
use warnings;
use autodie;
my #fh;
for my $filename (#ARGV) {
open my $fh, '<', $filename;
push #fh, $fh;
}
until (grep eof $_, #fh) {
my #records;
for my $fh (#fh) {
my $line = <$fh>;
chomp $line;
push #records, [ split ' ', $line ];
}
$records[0][$_] -= $records[1][$_] for 1 .. $#{$records[0]};
print "#{$records[0]}\n";
}

bash join multiple files with empty replacement (-e option)

I have following code to join multiple files together. It works fine but I want to replace the empty values to 0, so I used -e "0". But it doesn't work.
Any ideas?
for k in `ls file?`
do
if [ -a final.results ]
then
join -a1 -a2 -e "0" final.results $k > tmp.res
mv tmp.res final.results
else
cp $k final.results
fi
done
example:
file1:
a 1
b 2
file2:
a 1
c 2
file3:
b 1
d 2
Results:
a 1 0 1 0
b 2 1 0
c 2
d 2
expected:
a 1 1 0
b 2 0 1
c 0 2 0
d 0 0 2
An aside, the GNU version of join supports -o auto. The -e and -o cause enough frustration to turn people to learning awk. (See also How to get all fields in outer join with Unix join?). As cmh said: it's [not] documented, but when using join the -e option only works in conjunction with the -o option.
General solution:
cut -d ' ' -f1 file? | sort -u > tmp.index
for k in file?; do join -a1 -e '0' -o '2.2' tmp.index $k > tmp.file.$k; done
paste -d " " tmp.index tmp.file.* > final.results
rm tmp*
Bonus: how do I compare multiple branches in git?
for k in pmt atc rush; do git ls-tree -r $k | cut -c13- > ~/tmp-branch-$k; done
cut -f2 ~/tmp-branch-* | sort -u > ~/tmp-allfiles
for k in pmt atc rush; do join -a1 -e '0' -t$'\t' -11 -22 -o '2.2' ~/tmp-allfiles ~/tmp-branch-$k > ~/tmp-sha-$k; done
paste -d " " ~/tmp-allfiles ~/tmp-sha-* > final.results
egrep -v '(.{40}).\1.\1' final.results # these files are not the same everywhere
It's poorly documented, but when using join the -e option only works in conjunction with the -o option. The order string needs to be amended each time around the loop. The following code should generate your desired output.
i=3
orderl='0,1.2'
orderr=',2.2'
for k in $(ls file?)
do
if [ -a final.results ]
then
join -a1 -a2 -e "0" -o "$orderl$orderr" final.results $k > tmp.res
orderl="$orderl,1.$i"
i=$((i+1))
mv tmp.res final.results
else
cp $k final.results
fi
done
As you can see, it starts to become messy. If you need to extend this much further it might be worth deferring to a beefier tool such as awk or python.
Assuming there are no duplicate keys in a single file and the keys do not contain whitespace, you could use gawk and a sorted glob of files. This approach would be quite quick for large files and would use only a relatively small amount of memory compared to a glob of all of the data. Run like:
gawk -f script.awk $(ls -v file*)
Contents of script.awk:
BEGINFILE {
c++
}
z[$1]
$1 in a {
a[$1]=a[$1] FS ($2 ? $2 : "0")
next
}
{
for(i=1;i<=c;i++) {
r = (r ? r FS : "") \
(i == c ? ($2 ? $2 : "0") : "0")
}
a[$1]=r; r=""
b[++n]=$1
}
ENDFILE {
for (j in a) {
if (!(j in z)) {
a[j]=a[j] FS "0"
}
}
delete z
}
END {
for (k=1;k<=n;k++) {
print b[k], a[b[k]]
}
}
Test input / Results of grep . file*:
file1:a 1
file1:x
file1:b 2
file2:a 1
file2:c 2
file2:g
file3:b 1
file3:d 2
file5:m 6
file5:a 4
file6:x
file6:m 7
file7:x 9
file7:c 8
Results:
a 1 1 0 4 0 0
x 0 0 0 0 0 9
b 2 0 1 0 0 0
c 0 2 0 0 0 8
g 0 0 0 0 0 0
d 0 0 2 0 0 0
m 0 0 0 6 7 0
I gave up using join and wrote my script in other way
keywords=`cat file? | awk '{print $1}' | sort | uniq | xargs`
files=`ls file? | xargs`
for p in $keywords
do
x=`echo $p`
for k in $files
do
if grep -q ^$p $k
then
y=`cat $k | grep ^$p | awk '{print $2}'`
x=`echo $x $y`
else
echo $p $k
x=`echo $x 0`
fi
done
echo $x >> final.results
done

sorting a "key/value pair" array in bash

How do I sort a "python dictionary-style" array e.g. ( "A: 2" "B: 3" "C: 1" ) in bash by the value? I think, this code snippet will make it bit more clear about my question.
State="Total 4 0 1 1 2 0 0"
W=$(echo $State | awk '{print $3}')
C=$(echo $State | awk '{print $4}')
U=$(echo $State | awk '{print $5}')
M=$(echo $State | awk '{print $6}')
WCUM=( "Owner: $W;" "Claimed: $C;" "Unclaimed: $U;" "Matched: $M" )
echo ${WCUM[#]}
This will simply print the array: Owner: 0; Claimed: 1; Unclaimed: 1; Matched: 2
How do I sort the array (or the output), eliminating any pair with "0" value, so that the result like this:
Matched: 2; Claimed: 1; Unclaimed: 1
Thanks in advance for any help or suggestions. Cheers!!
Quick and dirty idea would be (this just sorts the output, not the array):
echo ${WCUM[#]} | sed -e 's/; /;\n/g' | awk -F: '!/ 0;?/ {print $0}' | sort -t: -k 2 -r | xargs
echo -e ${WCUM[#]} | tr ';' '\n' | sort -r -k2 | egrep -v ": 0$"
Sorting and filtering are independent steps, so if you only like to filter 0 values, it would be much more easy.
Append an
| tr '\n' ';'
to get it to a single line again in the end.
nonull=$(for n in ${!WCUM[#]}; do echo ${WCUM[n]} | egrep -v ": 0;"; done | tr -d "\n")
I don't see a good reason to end $W $C $U with a semicolon, but $M not, so instead of adapting my code to this distinction I would eliminate this special case. If not possible, I would append a semicolon temporary to $M and remove it in the end.
Another attempt, using some of the bash features, but still needs sort, that is crucial:
#! /bin/bash
State="Total 4 1 0 4 2 0 0"
string=$State
for i in 1 2 ; do # remove unnecessary fields
string=${string#* }
string=${string% *}
done
# Insert labels
string=Owner:${string/ /;Claimed:}
string=${string/ /;Unclaimed:}
string=${string/ /;Matched:}
# Remove zeros
string=(${string[#]//;/; })
string=(${string[#]/*:0;/})
string=${string[#]}
# Format
string=${string//;/$'\n'}
string=${string//:/: }
# Sort
string=$(sort -t: -nk2 <<< "$string")
string=${string//$'\n'/;}
echo "$string"

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?
$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2
You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....
#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.
Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2
Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'
Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

Resources