Linux: I want to fetch the top 10 records of the column3. The column has some missing data. I have sorted the file - linux

I am not sure how to ignore the missing data here.
My ; separated file looks like (writing it with spaces so that it is readable):
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
and so on....
I want to fetch the records with maximum values of column3
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
I sorted the file on column3 in reverse order. Not sure how to proceed further.
sort -t';' -k3 -r original.txt > newfile.txt

Something great about the command line is that you can easily use the best tool for the proper application, either chaining output with pipes | or by creating temporary files like newfile.txt.
In this case, using sort is the apt choice for sorting your data. Once it's sorted, you can use a separate tool that's very efficient at parsing data, awk.
Starting from your sorted newfile.txt, this awk operation will only print a line with 5 fields (assuming your missing data is actually missing and there are no empty separators, e.g. your line looks like 45;c; rather than 45;c;;;)
awk -F';' 'NF == 5 { print }' newfile.txt
However, in the case that the empty fields are delimited (e.g. 45;c;;;), and assuming that only columns 3 through 5 may have missing data, this will handle it:
awk -F';' 'NF == 5 && $3 && $4 && $5 { print }' newfile.txt
Note that since the default behavior of awk is to print, the above { print } is actually unnecessary, but included pedagogically.
Thus, from start to finish, you can get your desired result with,
sort -t ';' -rk3 original.txt | awk 'NF==5 && $3 && $4 && $5' > result.txt

You can use the following command:
$ head -1 fileToSort; (sort -k3 -n -r <(tail -n +2 fileToSort) | head)
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
1 b ? ? ?
12 a ? ? ?
where fileToSort is
cat fileToSort
Col1 Col2 Col3 Col4 Col5
12 a ? ? ?
1 b ? ? ?
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
Explanations:
Use -t';' if your field separator is ;
<(tail -n +2 fileToSort) will exclude the header of the input file
You then sort it via the 3rd key in reverse and numeric mode using -n
head will limit the output at the first 10 records
head -1 fileToSort; will print the header line before printing the top 10 records
If you need an awk solution:
awk 'NR==1;NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/{buff[$3]=$0}END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}}' fileToSort
Col1 Col2 Col3 Col4 Col5
45 c 7.22 6.09 2.2
11 d 7.0 3.89 9.7
26 e 6.24 8.2 5.9
You might need to add -F';' just after the awk command if your file does have ; instead of spaces. Your command will become: awk -F';' ...
NR==1; print the first line
NF == 5 && $3~/^[0-9]+(\.[0-9]+)+$/ && $4~/^[0-9]+(\.[0-9]+)+$/ && $5~/^[0-9]+(\.[0-9]+)+$/ check that you have 5 fields and that the values of the 3 last columns are numerical
{buff[$3]=$0} save each line in a buffer indexed by the col3 value
END{n=asorti(buff,out); for (i = n; i >= 1; i--){print buff[out[i]]}} at the end of the processing just order the array depending on the value of the index and print it in the reverse order.

Related

Bash: Reading CSV text file and finding average of rows

This is the sample input (the data has user-IDs and the number of hours spent by the user):
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).
I wrote the following script:
hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
if [[ $col2 == *"2" ]]; then
#echo "$col2"
((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"4" ]]; then
#echo "$col2"
((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"6" ]]; then
#echo "$col2"
((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"8" ]]; then
#echo "$col2"
((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
elif [[ $col2 == *"10" ]]; then
#echo "$col2"
((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"
This is not a very good way of doing this. Also, the numbers arent adding up correctly.
I am getting the following output (for the first one - user2):
27
5
I am expecting the following output:
27
5.4
What would be a better way to do it? Any help would be appreciated.
TIA
Your description is fairly imprecise, but here's an attempt primarily based on the sample output:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file
20
4
27
5.4
$2~/[24680]$/ makes sure we only look at "even" user-IDs.
for(i=3;i<=7;i++){} iterates over the day columns and adds them.
Edit 1:
Accommodating new requirement:
awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad
User4 4
User2 5.4
Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:
$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5
One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:
$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt
Where:
-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)
The above generates:
User120 27 5.4
User2 55 11
User4 20 4
Depending on OPs desired output the print can be replaced with printf and the desired format string ...
You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.
Easy to demonstrate:
$ hours=27
$ echo "$((hours/5))"
5
If you want to stick to Bash, you could use bc for the floating point result:
$ echo "$hours / 5.0" | bc -l
5.40000000000000000000
Or use awk, perl, python, ruby etc.
Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)
awk -F, 'FNR==1{print $2; next}
{arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User1 27 5.4
User2 27 5.4
User3 22 4.4
User4 20 4
User5 40 8
If you only want even users, filter for User that end in any of 0,2,4,6,8:
awk -F, 'FNR==1{print $2; next}
$2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) }
END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file
Prints:
User ID
User2 27 5.4
User4 20 4
Here is your script modified a little bit:
while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
do
(( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
done < <(tail -n+2 list.txt)
prints:
For Computer3 User4 average is: 4.00000000000000000000
For Computer5 User2 average is: 5.40000000000000000000

Awk: Sum up column values across multiple files with identical column layout

I have a number of files with the same header:
COL1, COL2, COL3, COL4
You can ignore COL1-COL3. COL4 contains a number. Each file contains about 200 rows. I am trying to sum up across the rows. For example:
File 1
COL1 COL2 COL3 COL4
x y z 3
a b c 4
File 2
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Then a new file is returned:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
Is there a simple way to do this without AWK? I will use AWK if need be, I just thought there might be a simple one-liner that I could just run right away. The AWK script I have in mind feels a bit long.
Thanks
Combining paste with awk, as in Kristo Mägi's answer, is your best bet:
paste merges the corresponding lines from the input files,
which sends a single stream of input lines to awk, with each input line containing all fields to sum up.
Assuming a fixed number of input files and columns, Kristo's answer can be simplified to (making processing much more efficient):
paste file1 file2 | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
Note: The above produces space-separated output columns, because awk's default value for OFS, the output-field separator, is a single space.
Assuming that all files have the same column structure and line count, below is a generalization of the solution, which:
generalizes to more than 2 input files (and more than 2 data rows)
generalizes to any number of fields, as long as the field to sum up is the last one.
#!/bin/bash
files=( file1 file2 ) # array of input files
paste "${files[#]}" | awk -v numFiles=${#files[#]} -v OFS='\t' '
{
row = sep = ""
for(i=1; i < NF/numFiles; ++i) { row = row sep $i; sep = OFS }
sum = $(NF/numFiles) # last header col. / (1st) data col. to sum
if (NR > 1) { for(i=2; i<=numFiles; ++i) sum += $(NF/numFiles * i) } # add other cols.
printf "%s%s%s\n", row, OFS, sum
}
'
Note that \t (the tab char.) is used to separate output fields and that, due to relying on awk's default line-splitting into fields, preserving the exact input whitespace between fields is not guaranteed.
If all files have the same header - awk solution:
awk '!f && FNR==1{ f=1; print $0 }FNR>1{ s[FNR]+=$NF; $NF=""; r[FNR]=$0 }
END{ for(i=2;i<=FNR;i++) print r[i],s[i] }' File[12]
The output (for 2 files):
COL1 COL2 COL3 COL4
x y z 8
a b c 14
This approach can be applied to multiple files (in that case you may specify globbing File* for filename expansion)
One more option.
The command:
paste f{1,2}.txt | sed '1d' | awk '{print $1,$2,$3,$4+$8}' | awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
The result:
COL1 COL2 COL3 COL4
x y z 8
a b c 14
What it does:
Test files:
$ cat f1.txt
COL1 COL2 COL3 COL4
x y z 3
a b c 4
$ cat f2.txt
COL1 COL2 COL3 COL4
x y z 5
a b c 10
Command: paste f{1,2}.txt
Joins 2 files and gives output:
COL1 COL2 COL3 COL4 COL1 COL2 COL3 COL4
x y z 3 x y z 5
a b c 4 a b c 10
Command: sed '1d'
Is meant to remove header temporarily
Command: awk '{print $1,$2,$3,$4+$8}'
Returns COL1-3 and sums $4 and $8 from paste result.
Command: awk 'BEGIN{print "COL1","COL2","COL3","COL4"}1'
Adds header back
EDIT:
Following #mklement0 comment, he is right about header handling as I forgot the NR==1 part.
So, I'll proxy his updated version here also:
paste f{1,2}.txt | awk '{ print $1, $2, $3, (NR==1 ? $4 : $4 + $8) }'
You state you have "a number of files". i.e., more than 2.
Given these 3 files (and should work with any number):
$ cat f1 f2 f3
COL1 COL2 COL3 COL4
x y z 3
a b c 4
COL1 COL2 COL3 COL4
x y z 5
a b c 10
COL1 COL2 COL3 COL4
x y z 10
a b c 15
You can do:
$ awk 'FNR==1{next}
{sum[$1]+=$4}
END{print "COL1 COL4";
for (e in sum) print e, sum[e]} ' f1 f2 f3
COL1 COL4
x 18
a 29
It is unclear what you intend to do with COL2 or COL3, so I did not add that.
$ awk '
NR==1 { print }
{ sum[FNR]+=$NF; sub(/[^[:space:]]+[[:space:]]*$/,""); pfx[FNR]=$0 }
END { for(i=2;i<=FNR;i++) print pfx[i] sum[i] }
' file1 file2
COL1 COL2 COL3 COL4
x y z 8
a b c 14
The above will work robustly and efficiently with any awk on any UNIX system, with any number of input files and with any contents of those files. The only potential problem would be that it has to retain the equivalent of 1 of those files in memory so if each file was absolutely massive then you may exhaust available memory.

awk difference between subsequent lines

This is a great example how to solve the problem if I want to print differences between subsequent lines of a single column.
awk 'NR>1{print $1-p} {p=$1}' file
But how would I do it if I have multiple (unknown) number of columns in the file and I want the differences for all of them, eg. (note that the number of columns is not necessarily 3, it can be 10 or 15 or more)
col1 col2 col3
---- ---- ----
1 3 2
2 4 10
1 9 -3
. . .
the output would be:
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13
. . .
Instead of saving the first column, save the entire line and you would able to split it then print the difference using a loop:
awk 'NR>1{for(i=1;i<=NF;i++) printf "%d ", $i - a[i] ; print ""}
{p=split($0, a)}' file
If you need the column title then you can print it using BEGIN.
$ awk 'NR<3; NR>3{for (i=1;i<=NF;i++) printf "%d%s", $i-p[i], (i<NF?OFS:ORS)} {split($0,p)}' file | column -t
col1 col2 col3
---- ---- ----
1 1 8
-1 5 -13

Select rows in one file based on specific values in the second file (Linux)

I have two files:
One is "total.txt". It has two columns: the first column is natural numbers (indicator) ranging from 1 to 20, the second column contains random numbers.
1 321
1 423
1 2342
1 7542
2 789
2 809
2 5332
2 6762
2 8976
3 42
3 545
... ...
20 432
20 758
The other one is "index.txt". It has three columns:(1.indicator, 2:low value, 3: high value)
1 400 5000
2 600 800
11 300 4000
I want to output the rows of "total.txt" file with first column matches with the first column of "index.txt" file. And at the same time, the second column of output results must be larger than (>) the second column of the "index.txt" and smaller than (<) the third column of the "index.txt".
The expected result is as follows:
1 423
1 2342
2 809
2 5332
2 6762
11 ...
11 ...
I have tried this:
awk '$1==(awk 'print($1)' index.txt) && $2 > (awk 'print($2)' index.txt) && $1 < (awk 'print($2)' index.txt)' total.txt > result.txt
But it failed!
Can you help me with this? Thank you!
You need to read both files in the same awk script. When you read index.txt, store the other columns in an array.
awk 'FNR == NR { low[$1] = $2; high[$1] = $3; next }
$2 > low[$1] && $2 < high[$1] { print }' index.txt total.txt
FNR == NR is the common awk idiom to detect when you're processing the first file.
Use join like Barmar said:
# To join on the first columns
join -11 -21 total.txt index.txt
And if the files aren't sorted in lexical order by the first column then:
join -11 -21 <(sort -k1,1 total.txt) <(sort -k1,1 index.txt)

Using AWK to add a column to a TSV (tabbed separated file)

I want to add a column to the beginning of a tabbed delimited file using awk, so a line like
col1 col2 col3
would end up like
345 col1 col2 col3
So far I have this
awk '{FS=" "; OFS=" "; print '345' $0;}' file.tsv > output.tsv
but I end up with
345col1 col2 col3
Where am I going wrong ?
Thanks
you need a comma after the '345'
You can simply do:
awk '{print "345\t"$0}' file.tsv > output.tsv
This awk should work:
awk '{FS=OFS="\t"; print '345', $0}' file
EDIT: Output on OSX with od to prove it works on OSX:
$> awk '{FS=OFS="\t"; print '345', $0}' file|od -c
0000000 3 4 5 \t c o l 1 \t c o l 2 \t c o
0000020 l 3 \n
0000023
$> uname -a
Darwin US143639.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 i386
This might work for you (GNU sed):
sed 's/^/345\t/' file

Resources