Subtructing n number of columns from two files with AWK - linux

I have two files with N number of columns
File1:
A 1 2 3 ....... Na1
B 2 3 4 ....... Nb1
File2:
A 2 2 4 ....... Na2
B 1 3 4 ....... Nb2
i want a output where 1st column value from File1 will be subtracted from 1st column of File2, and this way till column N as shown below:
A -1 0 -1 ........ (Na1-Na2)
B 1 0 0 ........ (Nb1-Nb2)
How to do this is AWK, or Perl scripting in Linux environment?

This has already been answered, but I will add a one-liner. It uses paste, to concatenate the files, and awk to subtract:
paste file{1,2} | awk '{for (i=1;i<=NF/2;i++) printf "%s ", ($i==$i+0)?$i-$(i+NF/2):$i; print ""}'
Validation:
$ cat file1
A 1 2 3 4 5
B 2 3 4 5 6
$ cat file2
A 2 2 4 10 12
B 1 3 4 3 5
$ paste file{1,2} | awk '{for (i=1;i<=NF/2;i++) printf "%s ", ($i==$i+0)?$i-$(i+NF/2):$i; print ""}'
A -1 0 -1 -6 -7
B 1 0 0 2 1
It requires both files to have the same number of columns. Non-numeric columns should be at the same position. It prints the value in the first file if non-numeric, otherwise prints the difference.

Try:
awk '{split($0,S); getline<f; for(i=2; i<=NF; i++) $i-=S[i]}1' OFS='\t' f=file1 file2

Here's one way using GNU awk. Run like:
awk -f script.awk File2 File1 | rev | column -t | rev
Contents of script.awk:
FNR==NR {
for(i=2;i<=NF;i++) {
a[$1][i]=$i
}
next
}
{
for(j=2;j<=NF;j++) {
$j-=a[$1][j]
}
}1
Alternatively, here's the one-liner:
awk 'FNR==NR { for(i=2;i<=NF;i++) a[$1][i]=$i; next } { for(j=2;j<=NF;j++) $j-=a[$1][j] }1' File2 File1 | rev | column -t | rev
Results:
A -1 0 -1
B 1 0 0

awk 'FNR==NR{for(i=2;i<=NF;i++)a[FNR"-"i]=$i;next}{printf "\n"$1" ";for(i=2;i<=NF;i++){printf $i-a[FNR"-"i]" "}}' file1 file2
> cat file1
A 1 2 3
B 2 3 4
> cat file2
A 2 2 4
B 1 3 4
> awk 'FNR==NR{for(i=2;i<=NF;i++)a[FNR"-"i]=$i;next}{printf "\n"$1" ";for(i=2;i<=NF;i++){printf $i-a[FNR"-"i]" "}}' file1 file2
A 1 0 1
B -1 0 0
>
Alternatively put this in a file
#!/usr/bin/awk
FNR==NR{
for(i=2;i<=NF;i++)
a[FNR"-"i]=$i;next
}
{
printf "\n"$1" ";
for(i=2;i<=NF;i++)
{
printf $i-a[FNR"-"i]" "
}
}
and execute as:
awk -f file.awk file1 file2

Something like this:
use strict;
use warnings;
my (#fh, #v);
for (#ARGV) {
open (my $handle, "<", $_) or die ("$!: $_");
push #fh, $handle;
}
while (#v = map { [split ' ', <$_> ] } #fh and defined shift #{$v[0]}) {
print join(" ", (shift #{$v[1]}, map { $_ - shift(#{$v[1]}) } #{$v[0]})), "\n";
}
close $_ for (#fh);
To run:
perl script.pl input1 input2

Something like this perhaps? I'm afraid I can't test this code as I have no PC to hand at present.
This program expects the names of the two files as parameters on the command line, and outputs the results to STDOUT.
use strict;
use warnings;
use autodie;
my #fh;
for my $filename (#ARGV) {
open my $fh, '<', $filename;
push #fh, $fh;
}
until (grep eof $_, #fh) {
my #records;
for my $fh (#fh) {
my $line = <$fh>;
chomp $line;
push #records, [ split ' ', $line ];
}
$records[0][$_] -= $records[1][$_] for 1 .. $#{$records[0]};
print "#{$records[0]}\n";
}

Related

Print the input search string if grep doesn't match

I have file1
BOB
JOHN
SALLY
I have file2
There was a boy called JOHN and he was playing with FRED while
JILL went off to find a bucket of water from TOM but she
fell down the hill.
I want to iterate through the file1 words and search for these in file2.
I want to print the words that are NOT found in file2.
So the output would be
BOB
SALLY
I guess it is if the grep fails, I'd like to print the string that grep was searching for.
I'm starting here:
grep -o -f file1 file2
But of course, this returns
JOHN
How would I get the original search strings that didn't match - to print instead?
Here is a grep one liner to get this done:
grep -vxFf <(tr '[[:blank:]]' '\n' < file2) file1
BOB
SALLY
Using tr to convert space/tab to newline first then using grep -vxFf to get non-matching words in file1.
Or as David suggested in comments below:
grep -vxFf <(printf '%s\n' $(<file2)) file1
With your shown samples could you please try following.
awk '
FNR==NR{
arr[$0]
next
}
{
for(i in arr){
if(index($0,i)){
delete arr[i]
next
}
}
}
END{
for(i in arr){
print i
}
}
' file1 file2
If the order isn't critical, you can use:
awk '
FNR == NR { a[$1]=0; next }
{ for (i=1;i<=NF;i++)
if ($i in a)
a[$i]++
}
END {
for (i in a)
if (!a[i])
print i
}
' file1 file2
Example Use/Output
$ awk '
> FNR == NR { a[$1]=0; next }
> { for (i=1;i<=NF;i++)
> if ($i in a)
> a[$i]++
> }
> END {
> for (i in a)
> if (!a[i])
> print i
> }
> ' file1 file2
SALLY
BOB

Convert number from text file

I have a file:
id name date
1 paul 23.07
2 john 43.54
3 marie 23.4
4 alan 32.54
5 patrick 32.1
I want to print names that start with "p" and have an odd numbered id
My command:
grep "^p" filename | cut -d ' ' -f 2 | ....
result:
paul
patrick
Awk can do it all:
$ awk 'NR > 1 && $2 ~ /^p/ && ($1 % 2) == 1 { print $2 }' op.txt
paul
patrick
EDIT
To use : as the field separator:
$ awk -F: 'NR > 1 && $2 ~ /^p/ && ($1 % 2) == 1 { print $2 }' op.txt
NR > 1
Skip the header
$2 ~ /^p/
Name field starts with p
$1 % 2 == 1
ID field is odd
If all of the above are true:
{ print $2 }
Print the name field
How about a little awk?
awk '{if ($1 % 2 == 1 && substr($2, 1, 1) == "p") print $2}' filename
In awk the fields are split by spaces, tabs and newlines by default, so your id is available as $1, name as $2 etc. The if is quite self-explanatory, when the condition is true, the name is printed out, otherwise nothing is done. AWK and its syntax is far more friendly than people usually think.
Just remember the basic pattern:
BEGIN {
# ran once in the beginning
}
{
# done for each line
}
END {
# ran once in the end
}
If you need a more complex parsing, you can keep the script clear and readable in a separate file and call it like this:
awk -f script.awk filename
You might try this
grep -e "[0-9]*[13579]\s\+p[a-z]\+" -o text | tr -s ' ' | cut -d ' ' -f 2
Odd number is easily represented by regex which here we write
[0-9]*[13579]
If you try to run this command with sample file name text
file: text
id name date
1 paul 23.07
2 john 43.54
3 marie 23.4
5 patrick 32.1
38 peter 21.44
10019 peyton 12.02
you will get outputs:
paul
patrick
peyton
Note that tr -s ' ' uses to make sure that your delimiter is always 1 space.

Random selection of columns using linux command

I have a flat file (.txt) with 606,347 columns and I want to extract 50,000 RANDOM columns, with exception of the first column, which is sample identification. How can I do that using Linux commands?
My file looks like:
ID SNP1 SNP2 SNP3
1 0 0 2
2 1 0 2
3 2 0 1
4 1 1 2
5 2 1 0
It is TAB delimited.
Thank you so much.
Cheers,
Paula.
awk to the rescue!
$ cat shuffle.awk
function shuffle(a,n,k) {
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN {srand()}
NR==1 {shuffle(ar,NF,ncols)}
{for(i=1;i<=ncols;i++) printf "%s", $(ar[i]) FS; print ""}
general usage
$ echo $(seq 5) | awk -f shuffle.awk -v ncols=5
3 4 1 5 2
in your special case you can print $1 and start the function loop from 2.
i.e. change
for(i=1;i<=k;i++) to a[1]=1; for(i=2;i<=k;i++)
Try this:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | xargs -d '\n' | tr ' ' ',' | xargs -I {} cut -d $'\t' -f {} file
Update:
echo {2..606347} | tr ' ' '\n' | shuf | head -n 50000 | sed 's/.*/&p/' | sed -nf - <(tr '\t' '\n' <file) | tr '\n' '\t'
#karakfa 's answer is great, but the NF value can't be obtained in the BEGIN{} part of the awk script. Refer to: How to get number of fields in AWK prior to processing
I edited the code as:
head -4 10X.txt | awk '
function shuffle(a,n,k){
for(i=1;i<=k;i++) {
j=int(rand()*(n-i))+i
if(j in a) a[i]=a[j]
else a[i]=j
a[j]=i;
}
}
BEGIN{
FS=" ";OFS="\t"; ncols=10;
}NR==1{shuffle(tmp_array,NF,ncols);
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]) OFS;
}
print "";
}NR>1{
printf "%s", $1 OFS;
for(i=1;i<=ncols;i++){
printf "%s", $(tmp_array[i]+1) OFS;
}
print "";
}'
Because I am processing the single-cell gene expression profiles, so from the second row, the first column will be gene names.
My output is:
D4-2_3095 D6-1_3010 D16-2i_1172 D4-1_337 iPSCs-2i_227 D4-2_170 D12-serum_1742 D4-1_1747 D10-2-2i_1373 D4-1_320
Sox17 0 0 0 0 0 0 0 0 0 0
Mrpl15 0.987862442831866 1.29176904082314 2.12650693025845 0 1.33257747910871 0 1.58815046312948 1.18541326956528 1.12103842107813 0.656789854017254
Lypla1 0 1.29176904082314 0 0 0.443505832809852 0.780385141793088 0.57601629238987 0 0 0.656789854017254

Print line numbers of duplicate entries

I have a file in the following format:
ABRA CADABRA
ABRA CADABRA
boys
girls
meds toys
I'd like to have the line number returned of any duplicate lines, so the results would look like the following:
1
2
I'd prefer a short one-line command with linux tools. I've tried experimenting with awk and sed but have not had success as of yet.
This would work:
nl file.txt | uniq -f 1 -D | cut -f 1
nl prepends a line number to each line
uniq finds duplicates
-f 1 ignores the first field, i.e., the line number
-D prints (only) the lines that are duplicate
cut-f 1 shows only the first field (the line number)
With a combination of sort, uniq, and awk you can use this series of commands.
sort File_Name | uniq -c | awk '{print $2}'
Here:
uniq -d < $file | while read line; do grep -hn "$line" $file; done
Do this:
perl -e 'my $l = 0; while (<STDIN>) { chomp; $l++; if (exists $f{$_}) { if ($f{$_}->[0]++ == 1) { print "$f{$_}->[1]\n"; print "$l\n"; } } else { $f{$_} = [1,$l]; } }' < FILE
Ugly, but works for unsorted files.
$ cat in.txt
ABRA CADABRA
ABRA CADABRA
boys
girls
meds toys
girls
$ perl -e 'my $l = 0; while (<STDIN>) { chomp; $l++; if (exists $f{$_}) { if ($f{$_}->[0]++ == 1) { print "$f{$_}->[1]\n"; print "$l\n"; } } else { $f{$_} = [1,$l]; } }' < in.txt
1
2
4
6
$
EDIT: Actually it will shorten slightly:
perl -ne '$l++; if (exists $f{$_}) { if ($f{$_}->[0]++ == 1) { print "$f{$_}->[1]\n"; print "$l\n"; } } else { $f{$_} = [1,$l]; }' < in.txt
to get all "different" duplicates in all lines you can try:
nl input.txt | sort -k 2 | uniq -D -f 1 | sort -n
this will not give you just the line numbers but also the duplicate found in those lines. Omit the last sort to get the duplicates grouped together.
also try running:
nl input.txt | sort -k 2 | uniq --all-repeated=separate -f 1
This will group the various duplicates by adding an empty line between groups of duplicates.
pipe results through
| cut -f 1 | sed 's/ \+//g'
to only get line numbers.
$ awk '{a[$0]=($0 in a ? a[$0] ORS : "") NR} END{for (i in a) if (a[i]~ORS) print a[i]}' file
1
2

unix - breakdown of how many lines with number of character occurrences

Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?
$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2
You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....
#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.
Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2
Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'
Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'

Resources