Count occurrences of character per line/field on Unix - linux

Given a file with data like this (ie stores.dat file)
sid|storeNo|latitude|longitude
2tt|1|-28.0372000t0|153.42921670
9|2t|-33tt.85t09t0000|15t1.03274200
What is the command that would return the number of occurrences of the 't' character per line?
eg. would return:
count lineNum
4 1
3 2
6 3
Also, to do it by count of occurrences by field what is the command to return the following results?
eg. input of column 2 and character 't'
count lineNum
1 1
0 2
1 3
eg. input of column 3 and character 't'
count lineNum
2 1
1 2
4 3

To count occurrence of a character per line you can do:
awk -F'|' 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"") "\t" NR}' file
count lineNum
4 1
3 2
6 3
To count occurrence of a character per field/column you can do:
column 2:
awk -F'|' -v fld=2 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"",$fld) "\t" NR}' file
count lineNum
1 1
0 2
1 3
column 3:
awk -F'|' -v fld=3 'BEGIN{print "count", "lineNum"}{print gsub(/t/,"",$fld) "\t" NR}' file
count lineNum
2 1
1 2
4 3
gsub() function's return value is number of substitution made. So we use that to print the number.
NR holds the line number so we use it to print the line number.
For printing occurrences of particular field, we create a variable fld and put the field number we wish to extract counts from.

grep -n -o "t" stores.dat | sort -n | uniq -c | cut -d : -f 1
gives almost exactly the output you want:
4 1
3 2
6 3
Thanks to #raghav-bhushan for the grep -o hint, what a useful flag. The -n flag includes the line number as well.

To count occurences of a character per line:
$ awk -F 't' '{print NF-1, NR}' input.txt
4 1
3 2
6 3
this sets field separator to the character that needs to be counted, then uses the fact that number of fields is one greater than number of separators.
To count occurences in a particular column cut out that column first:
$ cut -d '|' -f 2 input.txt | awk -F 't' '{print NF-1, NR}'
1 1
0 2
1 3
$ cut -d '|' -f 3 input.txt | awk -F 't' '{print NF-1, NR}'
2 1
1 2
4 3

One possible solution using perl:
Content of script.pl:
use warnings;
use strict;
## Check arguments:
## 1.- Input file
## 2.- Char to search.
## 3.- (Optional) field to search. If blank, zero or bigger than number
## of columns, default to search char in all the line.
(#ARGV == 2 || #ARGV == 3) or die qq(Usage: perl $0 input-file char [column]\n);
my ($char,$column);
## Get values or arguments.
if ( #ARGV == 3 ) {
($char, $column) = splice #ARGV, -2;
} else {
$char = pop #ARGV;
$column = 0;
}
## Check that $char must be a non-white space character and $column
## only accept numbers.
die qq[Bad input\n] if $char !~ m/^\S$/ or $column !~ m/^\d+$/;
print qq[count\tlineNum\n];
while ( <> ) {
## Remove last '\n'
chomp;
## Get fields.
my #f = split /\|/;
## If column is a valid one, select it to the search.
if ( $column > 0 and $column <= scalar #f ) {
$_ = $f[ $column - 1];
}
## Count.
my $count = eval qq[tr/$char/$char/];
## Print result.
printf qq[%d\t%d\n], $count, $.;
}
The script accepts three parameters:
Input file
Char to search
Column to search: If column is a bad digit, it searchs all the line.
Running the script without arguments:
perl script.pl
Usage: perl script.pl input-file char [column]
With arguments and its output:
Here 0 is a bad column, it searches all the line.
perl script.pl stores.dat 't' 0
count lineNum
4 1
3 2
6 3
Here it searches in column 1.
perl script.pl stores.dat 't' 1
count lineNum
0 1
2 2
0 3
Here it searches in column 3.
perl script.pl stores.dat 't' 3
count lineNum
2 1
1 2
4 3
th is not a char.
perl script.pl stores.dat 'th' 3
Bad input

No need for awk or perl, only with bash and standard Unix utilities:
cat file | tr -c -d "t\n" | cat -n |
{ echo "count lineNum"
while read num data; do
test ${#data} -gt 0 && printf "%4d %5d\n" ${#data} $num
done; }
And for a particular column:
cut -d "|" -f 2 file | tr -c -d "t\n" | cat -n |
{ echo -e "count lineNum"
while read num data; do
test ${#data} -gt 0 && printf "%4d %5d\n" ${#data} $num
done; }
And we can even avoid tr and the cats:
echo "count lineNum"
num=1
while read data; do
new_data=${data//t/}
count=$((${#data}-${#new_data}))
test $count -gt 0 && printf "%4d %5d\n" $count $num
num=$(($num+1))
done < file
and event the cut:
echo "count lineNum"
num=1; OLF_IFS=$IFS; IFS="|"
while read -a array_data; do
data=${array_data[1]}
new_data=${data//t/}
count=$((${#data}-${#new_data}))
test $count -gt 0 && printf "%4d %5d\n" $count $num
num=$(($num+1))
done < file
IFS=$OLF_IFS

awk '{gsub("[^t]",""); print length($0),NR;}' stores.dat
The call to gsub() deletes everything in the line that is not a t, then just print the length of what remains, and the current line number.
Want to do it just for column 2?
awk 'BEGIN{FS="|"} {gsub("[^t]","",$2); print NR,length($2);}' stores.dat

You could also split the line or field with "t" and check the length of the resulting array - 1. Set the col variable to 0 for the line or 1 through 3 for columns:
awk -F'|' -v col=0 -v OFS=$'\t' 'BEGIN {
print "count", "lineNum"
}{
split($col, a, "t"); print length(a) - 1, NR
}
' stores.dat

$ cat -n test.txt
1 test 1
2 you want
3 void
4 you don't want
5 ttttttttttt
6 t t t t t t
$ awk '{n=split($0,c,"t")-1;if (n!=0) print n,NR}' test.txt
2 1
1 2
2 4
11 5
6 6

cat stores.dat | awk 'BEGIN {FS = "|"}; {print $1}' | awk 'BEGIN {FS = "\t"}; {print NF}'
Where $1 would be a column number you want to count.

perl -e 'while(<>) { $count = tr/t//; print "$count ".++$x."\n"; }' stores.dat
Another perl answer yay! The tr/t// function returns the count of the number of times the translation occurred on that line, in other words the number of times tr found the character 't'. ++$x maintains the line number count.

Related

How do I turn a text file with a single column into a matrix?

I have a text file that has a single column of numbers, like this:
1
2
3
4
5
6
I want to convert it into two columns, in the left to right order this way:
1 2
3 4
5 6
I can do it with:
awk '{print>"line-"NR%2}' file
paste line-0 line-1 >newfile
But I think the reliance on two intermediate files will make it fragile in a script.
I'd like to use something like cat file | mystery-zip-command >newfile
You can use paste to do this:
paste -d " " - - < file > newfile
You can also use pr:
pr -ats" " -2 file > newfile
-a - use round robin order
-t - suppress header and trailer
-s " " - use single space as the delimiter
-2 - two column output
See also:
Convert a text file into columns
another alternative
$ seq 6 | xargs -n2
1 2
3 4
5 6
or with awk
$ seq 6 | awk '{ORS=NR%2?FS:RS}1'
1 2
3 4
5 6
if you want the output terminate with a new line in case of odd number of input lines..
$ seq 7 | awk '{ORS=NR%2?FS:RS}1; END{ORS=NR%2?RS:FS; print ""}'
1 2
3 4
5 6
7
awk 'NR % 2 == 1 { printf("%s", $1) }
NR % 2 == 0 { printf(" %s\n", $1) }
END { if (NR % 2 == 1) print "" }' file
The odd lines are printed with no newline after them, to print the first column. The even lines are printed with a space first and a newline after, to print the second column. At the end, if there were an odd number of lines, we print a newline so we don't end in the middle of the line.
With bash:
while IFS= read -r odd; do IFS= read -r even; echo "$odd $even"; done < file
Output:
1 2
3 4
5 6
$ seq 6 | awk '{ORS=(NR%2?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2
3 4
5 6
$
$ seq 7 | awk '{ORS=(NR%2?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2
3 4
5 6
7
$
Note that it always adds a terminating newline - that is important as future commands might depend on it, e.g.:
$ seq 6 | awk '{ORS=(NR%2?FS:RS); print}' | wc -l
3
$ seq 7 | awk '{ORS=(NR%2?FS:RS); print}' | wc -l
3
$ seq 7 | awk '{ORS=(NR%2?FS:RS); print} END{if (ORS==FS) printf RS}' | wc -l
4
Just change the single occurrence of 2 to 3 or however many columns you want if your requirements change:
$ seq 6 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
$ seq 7 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
7
$ seq 8 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
7 8
$ seq 9 | awk '{ORS=(NR%3?FS:RS); print} END{if (ORS==FS) printf RS}'
1 2 3
4 5 6
7 8 9
$
Short awk approach:
awk '{print ( ((getline nl) > 0)? $0" "nl : $0 )}' file
The output:
1 2
3 4
5 6
(getline nl)>0 - getline will get the next record and assign it to variable nl. The getline command returns 1 if it finds a record and 0 if it encounters the end of the file
Short GNU sed approach:
sed 'N;s/\n/ /' file
N - add a newline to the pattern space, then append the next line of input to the pattern space
s/\n/ / - replace newline with whitespace within captured pattern space
seq 6 | tr '\n' ' ' | sed -r 's/([^ ]* [^ ]* )/\1\n/g'

Sum all the numbers in a file given by positional parameter

I want to sum all the numbers in a file (columns and lines) given by the first parameter, but my program shows sum=sum+$i instead of the numeric sum:
sum=0;
file=$1
for i in $file
do
sum=sum+$i;
done;
echo "The sum is: " $sum
Input file:
$cat file.txt
10 20 10
40
50
Expected output :
The sum is: 21
Maybe if there is an awk method to solve this?
Try this -
$cat file1.txt
10 20 10
40
50
$awk '{for(i=1;i<=NF;i++) {sum+=$i}} END {print sum}' file1.txt
130
OR
$xargs < file1.txt| tr ' ' + | bc
130
cat file.txt | xargs | sed -e 's/\ /+/g' | bc
You can also use a simple read and an array to sum the value relying on word splitting to separate the values into an array via the default IFS (Internal Field Separator), e.g.
#!/bin/bash
declare -i sum=0
fn="${1:-/dev/stdin}" ## read from file as 1st argument (default stdin)
while read -r line; do ## read each line
a=( $line ) ## separate values into array
for i in ${a[#]}; do ## for each value in array
((sum += i)) ## add to sum
done
done <"$fn"
echo "sum: $sum"
Example Input File
$ cat dat/numfile.txt
10 20 10
40
50
Example Use/Output
$ bash sumnumfile.sh dat/numfile.txt
sum: 130
Another for some awks (at least mawk and gawk):
$ awk -v RS="[^0-9]" '{s+=$1}END{print s}' file
130

Linux command (Calculating the sum)

I have a .txt file with the following content:
a 3
a 4
a 5
a 6
b 1
b 3
b 5
c 9
c 10
I am wondering if there is any command (no awk if possible) that can read the .txt file and give the following output (Sorted by the second column):
c 19
a 18
b 9
You can use awk piped to sort:
awk '{sums[$1] += $2} END {for (i in sums) print i, sums[i]}' file | sort -rnk2
c 19
a 18
b 9
sums[$1] += $2 is adding value of $2 in an array sums that is indexed by field #1 ($1).
sort -rnk2 is reverse sorting numerically output of awk on field 2
Use can use this code:
cat 1.txt | awk '{arr[$1]+=$2}END{for (var in arr) print var," ",arr[var]}' | sort -rnk 2
Explanation:
cat 1.txt - read 1.txt file with content
awk - is a language very useful for data manipulation
{arr[$1]+=$2} for each line in content file increase array item with key first field with value of second field. Field separator by default is space.
END{for (var in arr) print var," ",arr[var]}' - after all line is proceeded, print array content
sort -rnk 2 - reverse numeric sort on field 2
Non-awk solutions.
perl
perl -lane '
$sum{$F[0]} += $F[1]
} END {
$, = " ";
print $_, $sum{$_} for reverse sort {$sum{$a} <=> $sum{$b}} keys %sum
' file.txt
bash version 4
declare -A sum
while read key val; do (( sum[$key] += $val )); done < file.txt
for key in "${!sum[#]}"; do echo "$key ${sum[$key]}"; done | sort -rn -k2
non-awk challenge accepted
vars=$(cut -d" " -f1 nums | uniq); paste <(echo "$vars") <(cat <(sed -e 's/ /+=/' nums) <(echo "$vars" | sed 's/$/;/') | bc) | sort -k2,2nr
c 19
a 18
b 9

Sort the tab-delimited numbers on each line of a file

I'm trying to sort the numbers on each line of a file individually. The numbers within one line are separated by tabs. (I used spaces but they're actually tabs.)
For example, for the following input
5 8 7 6
1 5 6 8
8 9 7 1
the desired output would be:
5 6 7 8
1 5 6 7
1 7 8 9
My attempt so far is:
let i=1
while read line
do
echo "$line" | tr " " "\n" | sort -g
cut -f $i fileName | paste -s >> tempFile$$
((++i))
done < fileName
This is the best I got - I'm sure it can be done in 6 characters with awk/sed/perl:
while read line
do
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t >> another-file.txt
done < my-input-file.txt
Using a few features that are specific to GNU awk:
$ awk 'BEGIN{ PROCINFO["sorted_in"] = "#ind_num_asc" }
{ delete(a); n = 0; for (i=1;i<=NF;++i) a[$i];
for (i in a) printf "%s%s", i, (++n<NF?FS:RS) }' file
5 6 7 8
1 5 6 8
1 7 8 9
Each field is set as a key in the array a. In GNU awk it is possible to specify the order in which the for (i in a) loop traverses the array - here, I've set it to do so in ascending numerical order.
Here is a bash script that can do it. It takes a filename argument or reads stdin, was tested on CentOS and assumes IFS=$' \t\n'.
#!/bin/bash
if [ "$1" ] ; then exec < "$1" ; fi
cat - | while read line
do
set $line
echo $(for var in "$#"; do echo $var; done | sort -n) | tr " " "\t"
done
If you want to put the output in another file run it as:
cat input_file | sorting_script > another_file
or
sorting_script input_file > another file
Consider using perl for this:
perl -ape '#F=sort #F;$_="#F\n"' input.txt
Here -a turns on automatic field splitting (like awk does) into the array #F, -p makes it execute the script for each line and print $_ each time, and -e specifies the script directly on the command line.
Not quite 6 characters, I'm afraid, Sean.
This should have been simple in awk, but it doen't quite have the features needed. If there had been an array $# corresponding to the fields $1, $2, etc., then the solution would have been awk '{asort $#}' input.txt, but sadly no such array exits. The loops required to move the fields into an array and out of it again make it longer than the bash version:
awk '{for(i=1;i<=NF;i++)a[i]=$i;asort(a);for(i=1;i<=NF;i++)printf("%s ",a[i]);printf("\n")}' input.txt
So awk isn't the right tool for the job here. It's also a bit odd that sort itself doesn't have a switch to control its sorting direction.
Using awk
$ cat file
5 8 7 6
1 5 6 8
8 9 7 1
$ awk '{c=1;while(c!=""){c=""; for(i=1;i<NF;i++){n=i+1; if($i>$n){c=$i;$i=$n;$n=c}}}}1' file
5 6 7 8
1 5 6 8
1 7 8 9
Better Readable version
awk '{
c=1
while(c!="")
{
c=""
for(i=1;i<NF;i++)
{
n=i+1
if($i>$n)
{
c=$i
$i=$n
$n=c
}
}
}
}1
' file
If you have ksh, you may try this
#!/usr/bin/env ksh
while read line ; do
set -s +A cols $line
echo ${cols[*]}
done < "input_file"
Test
[akshay#localhost tmp]$ cat test.ksh
#!/usr/bin/env ksh
cat <<EOF | while read line ; do set -s +A cols $line; echo ${cols[*]};done
5 8 7 6
1 5 6 8
8 9 7 1
EOF
[akshay#localhost tmp]$ ksh test.ksh
5 6 7 8
1 5 6 8
1 7 8 9

select the second line to last line of a file

How can I select the lines from the second line to the line before the last line of a file by using head and tail in unix?
For example if my file has 15 lines I want to select lines from 2 to 14.
tail -n +2 /path/to/file | head -n -1
perl -ne 'print if($.!=1 and !(eof))' your_file
tested below:
> cat temp
1
2
3
4
5
6
7
> perl -ne 'print if($.!=1 and !(eof))' temp
2
3
4
5
6
>
alternatively in awk you can use below:
awk '{a[count++]=$0}END{for(i=1;i<count-1;i++) print a[i]}' your_file
To print all lines but first and last ones you can use this awk as well:
awk 'NR==1 {next} {if (f) print f; f=$0}'
This always prints the previous line. To prevent the first one from being printed, we skip the line when NR is 1. Then, the last one won't be printed because when reading it we are printing the penultimate!
Test
$ seq 10 | awk 'NR==1 {next} {if (f) print f; f=$0}'
2
3
4
5
6
7
8
9

Resources