2d histogram making - linux

I have a data file containing two columns, like
1.1 2.2
3.1 4.5
1.2 4.5
3.2 4.6
1.1 2.3
4.2 4.9
4.2 1.1
I would like to make a histogram from the two columns, i.e. to get this output (if the step size (or bin size, as we talking about histogramming) equals to 0.1 in this case)
1.0 1.0 0
1.0 1.1 0
1.0 1.2 0
...
1.1 1.0 0
1.1 1.1 0
1.1 1.2 0
...
1.1 2.0 0
1.1 2.1 0
1.1 2.2 1
...
...
Can anybody suggest me something? It would be nice, if I can set the the range of values of the colmuns. In the above case the 1st column values goes from 1 to 4, and the same as for the second column.
EDITED: updated in order to handle more general data input, e.g. float numbers. The step size in the above case is 0.1, but it would be nice if it can be tunable for other settings, i.e. if step range (bin size) is for example 0.2, or 1.0.
If the step size is for example 1.0, then if I have 1.1 and 1.8 they have the same bin, we have to handle them together, for example (the range in this case let us say 4 for both of the two columns 0.0 ... 4.0)
1.1 1.8
2.5 2.6
1.4 2.1
1.3 1.5
3.3 4.0
3.8 3.9
4.0 3.2
4.0 4.0
output (if the bin size = 1.0)
1 1 2
1 2 1
1 3 0
1 4 0
2 1 0
2 2 1
2 3 0
2 4 0
3 1 0
3 2 0
3 3 1
3 4 1
4 1 0
4 2 0
4 3 1
4 4 1

awk 'END {
for (i = 0; ++i <= l;) {
for (j = 0; ++j <= l;)
printf "%d %d %d %s\n", i, j, \
b[i, j], (j < l ? x : ORS)
}
}
{
f[NR] = $1; s[NR] = $2
b[$1, $2]++
}' l=4 infile
You may try this (not thoroughly tested):
awk -v l=4 -v bs=0.1 'BEGIN {
if (!bs) {
print "invalid bin size" > "/dev/stderr"
exit
}
split(bs, t, ".")
t[2] || fl++
m = "%." length(t[2]) "f"
}
{
fk = fl ? int($1) : sprintf(m, $1)
sk = fl ? int($2) : sprintf(m, $2)
f[fk]; s[sk]; b[fk, sk]++
}
END {
if (!bs) exit 1
for (i = 1; int(i) <= l; i += bs) {
for (j = 1; int(j) <= l; j += bs) {
if (fl) {
fk = int(i); sk = int(j); m = "%d"
}
else {
fk = sprintf(m, i); sk = sprintf(m, j)
}
printf "%s" m OFS m OFS "%d\n", (i > 1 && fk != p ? ORS : x), fk, sk, b[fk, sk]
p = fk
}
}
}' infile

You can try this in bash:
for x in {1..4} ; do
for y in {1..4} ; do
echo $x%$y 0
done
done \
| join -1 1 -2 2 - -a1 <(sed 's/ /%/' FILE \
| sort \
| uniq -c \
| sort -k2 ) \
| sed 's/ 0 / /;s/%/ /'
It creates the table with all zeros in the last column, joins it with the real results (classic frequency table sort | uniq -c) and removes the zeros from the lines where a different number should be shown.

One solution in perl (sample output and usage to follow):
#!/usr/bin/perl -W
use strict;
my ($min, $step, $max, $file) = #ARGV
or die "Syntax: $0 <min> <step> <max> <file>\n";
my %seen;
open F, "$file"
or die "Cannot open file $file: $!\n";
my #l = map { chomp; $_} qx/seq $min $step $max/;
foreach my $first (#l) {
foreach my $second (#l) {
$seen{"$first $second"} = 0;
}
}
foreach my $line (<F>) {
chomp $line;
$line or next;
$seen{$line}++;
}
my $len = #l; # size of list
my $i = 0;
foreach my $key (sort keys %seen) {
printf("%s %d\n", $key, $seen{$key});
$i++;
print "\n" unless $i % $len;
}
exit(0);

Related

Compute sum for column 2 and average for all other columns in multiple files without considering missing values

I want to calculate the sum for column 2 and average for all other columns from 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the sum for column 2 and average for all other columns from these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, sum of ? 2 ?, average of ? 1 ?, average of ? 3 ?, and so on)
2.33 ? ? ? .
3 15 4.33 1.33 .
3 8 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files without considering missing values where the script was written for average for all columns.
awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
But I can't able to modify it to my desire output.
Based on your previous awk script, I modify it as followed,
$ cat awk_script
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++)
if(j==2) { printf "%s\t" ,Count[i,j] != 0 ? Sum[i,j] : "?" }
else {
if (Count[i,j] != 0){
val=Sum[i,j]/Count[i,j]
printf "%s%s\t",int(val),match(val,/\.[0-9]/)!=0 ? "."substr(val,RSTART+1,2):""
} else printf "?\t"
}
print ""
}
}
And the output would be:
$ awk -f awk_script ifile*
2.66 2 1 3 0
2.33 ? ? ? 0
3 15 4.33 1.33 0
3 8 4 0.66 0
0 0 0 0 0
Brief explanation,
if(j==2): print the sum of the value in each file
for the average value, I notice that the values are not rounded up, so extract the decimal part using substr(val,RSTART+1,2), and integer part using int(val)
$ cat tst.awk
BEGIN { dfltVal="?"; OFS="\t" }
{
for (colNr=1; colNr<=NF; colNr++) {
if ($colNr != dfltVal) {
sum[FNR,colNr] += $colNr
cnt[FNR,colNr]++
}
}
}
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
for (colNr=1; colNr<=NF; colNr++) {
val = dfltVal
if ( cnt[rowNr,colNr] != 0 ) {
val = int(100 * sum[rowNr,colNr] / (colNr==2 ? 1 : cnt[rowNr,colNr])) / 100
}
printf "%s%s", val, (colNr<NF ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk file1 file2 file3
2.66 2 1 3
2.33 ? ? ?
3 15 4.33 1.33
3 8 4 0.66

Average of multiple files without considering missing values

I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. But some of them are missing values. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 ? ? ? . 1 2 1 3 . 4 ? ? ? .
1 ? ? ? . 1 ? ? ? . 5 ? ? ? .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find a new file which will show the average of these 15 fils without considering the missing values.
ofile.txt
2.66 2 1 3 . (i.e. average of 3 1 4, average of ? 2 ? and so on)
2.33 ? ? ? .
3 5 4.33 1.33 .
3 2.67 4 0.66 .
. . . . .
This question is similar to my earlier question Average of multiple files in shell where the script was
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
But I can't able to modify it.
Use this:
paste ifile*.txt | awk '{n=f=0; for(i=1;i<=NF;i++){if($i*1){f++;n+=$i}}; print n/f}'
paste will show all files side by side
awk calculates the averages per line:
n=f=0; set the variables to 0.
for(i=1;i<=NF;i++) loop trough all the fields.
if($i*1) if the field contains a digit (multiplication by 1 will succeed).
f++;n+=$i increment f (number of fields with digits) and sum up n.
print n/f calculate n/f.
awk '
{
for (i = 1;i <= NF;i++) {
Sum[FNR,i]+=$i
Count[FNR,i]+=$i!="?"
}
}
END {
for( i = 1; i <= FNR; i++){
for( j = 1; j <= NF; j++) printf "%s ", Count[i,j] != 0 ? Sum[i,j]/Count[i,j] : "?"
print ""
}
}
' ifile*
assuming file are correctly feeded (no trailing empty space line, ...)
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++)
if ( $i != "?" ) { sum[FNR,i] += $i ; count[FNR,i]++ ;}
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
if ( count[line,col] > 0 ) printf " %f", sum[line,col]/count[line,col];
else printf " ? " ;
printf "\n" ;
}
}' ifile*.txt
I just check the '?' ...

Read all the entries of a matrix using shell script

I have a matrix,
A(i,j), i=1,m and j=1,n
I can read it in C and FORTRAN, but I can't read it in shell script. I know this is a very simple question, but I am very new to shell script. I want to read all entries and do some calculation e.g. I have a matrix:
A= 1 0 1 1
2 1 0 2
1 0 0 3
1 2 3 0
Now I want to compare each 0 with its above, below, left and right values. Finally I want to do some computation (lets say sum) with these four values around each zero. In the above example the result will be- for five zeros
1st zero: 3
2nd zero: 4
3rd zero: 4
4th zero: 6
5th zero: 6
So in FORTRAN, I can do it by reading all the values as
do j=1,n
do i=1,m
if (A(i,j) .eq. 0) then
B(i,j)=A(i-1,j)+A(i+1,j)+A(i,j+1)+A(i,j-1)
enddo
enddo
But I want to do it in shell script. How to do?
Assuming that data are given in "test.dat" (with no "A = "), I tried it anyway...
#!/bin/bash
inpfile="test.dat"
L=100 # some large value
for (( i = 0; i < L; i++ )) {
for (( j = 0; j < L; j++ )) {
A[ L * i + j ]=0
}
}
i=1
while read buf; do
inp=( $buf ); n=${#inp[#]}
if (( L <= n+1 )); then echo "L is too small"; exit -1; fi
for (( j = 1; j <= n; j++ )) {
A[ L * i + j ]=${inp[j-1]}
}
(( i++ ))
done < $inpfile
nzeros=0
for (( i = 1; i <= n; i++ )) {
for (( j = 1; j <= n; j++ )) {
if (( ${A[ L * i + j ]} == 0 )); then
(( nzeros++ ))
B[ nzeros ]=$(( \
${A[ L * (i-1) + j ]} + \
${A[ L * (i+1) + j ]} + \
${A[ L * i + j+1 ]} + \
${A[ L * i + j-1 ]} ))
fi
}
}
for (( k = 1; k <= nzeros; k++ )) {
printf "%dst zero: %d\n" $k ${B[k]}
}
Conclusion: Very painful. Fortran is recommended...(as expected)

Finding minimum and maximum values of the first column - grouped by the second column

I have lots of unsorted data in text file in the following form:
1.0 10
1.8 10
1.1 10
1.9 20
2.8 20
2.1 20
2.9 20
...
For each value in the second column, I want to get the interval of values in the first column. So for the example above, the result should be
1.0 1.8 10
1.9 2.9 20
How can I do this with c/c++, awk or other linux shell tools?
You can use this awk:
awk '{
if (!($2 in nmin) || $1<nmin[$2])
nmin[$2]=$1;
else if ($1>=nmax[$2])
nmax[$2]=$1
}
END {
for (a in nmin)
print nmin[a], nmax[a], a
}
' inFile
this one-liner should work for you:
awk '!($2 in i){i[$2]=$1}{a[$2]=$1}END{for(x in i)print i[x],a[x],x}' file
output:
1.0 1.8 10
1.9 2.9 20
I think this should work:
{ read vStart int &&
while read vNext nextInt; do
if [ $int -ne $nextInt ]; then
echo "$vStart $v $int";
vStart=$vNext;
fi
v=$vNext;
int=$nextInt;
done &&
echo "$vStart $v $int"; }
To add another alternative, you could do this in R as well:
d.in <- read.table(file = commandArgs(trailingOnly = T)[1]);
write.table(
aggregate(V1 ~ V2, d.in, function (x) c(min(x),max(x)))[,c(2,1)]
, row.names = F
, col.names = F
, sep = "\t");
Then just call this script with Rscript:
$ Rscript script.R data.txt
1 1.8 10
1.9 2.9 20

reading a tuple from a file with awk

Hi I need an script to read number of eth interrupts from the /proc/interrupts file with awk and find the total number of interrupts per CPU core.An then I want to use them in bash.The content of the file is;
CPU0 CPU1 CPU2 CPU3
47: 33568 45958 46028 49191 PCI-MSI-edge eth0-rx-0
48: 0 0 0 0 PCI-MSI-edge eth0-tx-0
49: 1 0 1 0 PCI-MSI-edge eth0
50: 28217 42237 65203 39086 PCI-MSI-edge eth1-rx-0
51: 0 0 0 0 PCI-MSI-edge eth1-tx-0
52: 0 1 0 1 PCI-MSI-edge eth1
59: 114991 338765 77952 134850 PCI-MSI-edge eth4-rx-0
60: 429029 315813 710091 26714 PCI-MSI-edge eth4-tx-0
61: 5 2 1 5 PCI-MSI-edge eth4
62: 1647083 208840 1164288 933967 PCI-MSI-edge eth5-rx-0
63: 673787 1542662 195326 1329903 PCI-MSI-edge eth5-tx-0
64: 5 6 7 4 PCI-MSI-edge eth5
I am reading this file with awk in this code:
#!/bin/bash
FILE="/proc/interrupts"
output=$(awk 'NR==1 {
core_count = NF
print core_count
next
}
/eth/ {
for (i = 2; i <= 2+core_count; i++)
totals[i-2] += $i
}
END {
for (i = 0; i < core_count; i++)
printf("%d\n", totals[i])
}
' $FILE)
core_count=$(echo $output | cut -d' ' -f1)
output=$(echo $output | sed 's/^[0-9]*//')
totals=(${output// / })
In this approach, I handle the total core count and then the total interrupts for per CORE in order to sort them in my script.But I can only handle the number in totals array lke this,
totals[0]=22222
totals[1]=33333
But I need to handle them as tuple with the name of the CPU cores.
totals[0]=(cPU1,2222)
totals[1]=(CPU',3333)
I think I must assign the names to an array and read themn to bash as tuples in my SED. How can I achieve this?
First of all, there's no such thing as a 'tuple' in bash. And arrays are completely flat. It means that you either have a 'scalar' variable, or one level array of scalars.
There is a number approaches to the task you're facing. Either:
If you're using a new enough bash (4.2 AFAIR), you can use an associative array (hash, map or however you call it). Then, the CPU names will be keys and the numbers will be values;
Create a plain array (perl-like hash) where odd indexes will have the keys (CPU names) and even ones — the values.
Create two separate arrays, one with the CPU names and other with the values,
Create just a single array, with CPU names separated from values by some symbol (i.e. = or :).
Let's cover approach 2 first:
#!/bin/bash
FILE="/proc/interrupts"
output=$(awk 'NR==1 {
core_count = NF
for (i = 1; i <= core_count; i++)
names[i-1] = $i
next
}
/eth/ {
for (i = 2; i <= 2+core_count; i++)
totals[i-2] += $i
}
END {
for (i = 0; i < core_count; i++)
printf("%s %d\n", names[i], totals[i])
}
' ${FILE})
core_count=$(echo "${output}" | wc -l)
totals=(${output})
Note a few things I've changed to make the script simpler:
awk now outputs `cpu-name number', one per line, separated by a single space;
the core count is not output by awk (to avoid preprocessing output) but instead deduced from number of lines in the output,
the totals array is created by flattening the output — both spaces and newlines will be treated as whitespace and use to separate the values.
The resulting array looks like:
totals=( CPU0 12345 CPU1 23456 ) # ...
To iterate over it, you could use something like (the simple way):
set -- "${totals[#}}"
while [[ $# -gt 0 ]]; do
cpuname=${1}
value=${2}
# ...
shift;shift
done
Now let's modify it for approach 1:
#!/bin/bash
FILE="/proc/interrupts"
output=$(awk 'NR==1 {
core_count = NF
for (i = 1; i <= core_count; i++)
names[i-1] = $i
next
}
/eth/ {
for (i = 2; i <= 2+core_count; i++)
totals[i-2] += $i
}
END {
for (i = 0; i < core_count; i++)
printf("[%s]=%d\n", names[i], totals[i])
}
' ${FILE})
core_count=$(echo "${output}" | wc -l)
declare -A totals
eval totals=( ${output} )
Note that:
the awk output format has been changed to suit the associative array semantics,
totals is declared as an associative array (declare -A),
sadly, eval must be used to let bash directly handle the output.
The resulting array looks like:
declare -A totals=( [CPU0]=12345 [CPU1]=23456 )
And now you can use:
echo ${totals[CPU0]}
for cpu in "${!totals[#]}"; do
echo "For CPU ${cpu}: ${totals[${cpu}]}"
done
The third approach can be done a number of different ways. Assuming you can allow two reads of /proc/interrupts, you could even do:
FILE="/proc/interrupts"
output=$(awk 'NR==1 {
core_count = NF
next
}
/eth/ {
for (i = 2; i <= 2+core_count; i++)
totals[i-2] += $i
}
END {
for (i = 0; i < core_count; i++)
printf("%d\n", totals[i])
}
' ${FILE})
core_count=$(echo "${output}" | wc -l)
names=( $(cat /proc/interrupts | head -n 1) )
totals=( ${output} )
So, now the awk is once again only outputting the counts, and names are obtained by bash from first line of /proc/interrupts directly. Alternatively, you could create split arrays from a single array obtained in approach (2), or parsing the awk output some other way.
The result would be in two arrays:
names=( CPU0 CPU1 )
totals=( 12345 23456 )
And output:
for (( i = 0; i < core_count; i++ )); do
echo "${names[$i]} -> ${totals[$i]}"
done
And the last approach:
#!/bin/bash
FILE="/proc/interrupts"
output=$(awk 'NR==1 {
core_count = NF
for (i = 1; i <= core_count; i++)
names[i-1] = $i
next
}
/eth/ {
for (i = 2; i <= 2+core_count; i++)
totals[i-2] += $i
}
END {
for (i = 0; i < core_count; i++)
printf("%s=%d\n", names[i], totals[i])
}
' ${FILE})
core_count=$(echo "${output}" | wc -l)
totals=( ${output} )
Now the (regular) array looks like:
totals=( CPU0=12345 CPU1=23456 )
And you can parse it like:
for x in "${totals[#]}"; do
name=${x%=*}
value=${x#*=}
echo "${name} -> ${value}"
done
(note now that splitting CPU name and value occurs in the loop).

Resources