Bash Script - Divide Colum 2 by Colum in the middle but keep 1 and 4 on either side - linux

I have a list that has an ID, population, area and province, that looks like this:
1:517000:405212:Newfoundland and Labrador
2:137900:5660:Prince Edward Island
3:751400:72908:New Brunswick
4:938134:55284:Nova Scotia
5:7560592:1542056:Quebec
6:12439755:1076359:Ontario
7:1170300:647797:Manitoba
8:996194:651036:Saskatchewan
9:3183312:661848:Alberta
10:4168123:944735:British Comumbia
11:42800:1346106:Northwest Territories
12:31200:482443:Yukon Territories
13:29300:2093190:Nunavut
I need display the names of the provinces with the lowest and highest population density (population/area). How can I divide column 1 by column 2 (2 decimal places) but keep the file information in tact on either side (eg. 1:1.28:Newfoundland and Labrador). After that I figure I can just pump it into sort -t: -nk2 | head -n 1 and sort -t: -nrk2 | head -n 1 to pull them.
The recommended command given was grep.

Since you seem to have the sorting and extraction under control, here's an example awk script that should work for you:
#!/usr/bin/env awk -f
BEGIN {
FS=":"
OFS=":"
OFMT="%.2f"
}
{
print $1,$2/$3,$4
}

Related

Using awk and sort last column in descending order in Linux

I have a file contains both name and numbers like :
data.csv
2016,Bmw,M2,2 Total score:24
1998,Subaru,Legacy,23 Total score:62
2012,Volkswagen,Golf,59 Total score:28
2001,Dodge,Viper,42 Total score:8
2014,Honda,Accord,83 Total score:112
2015,Chevy,Camaro,0 Total score:0
2008,Honda,Accord,88 Total score:48
Total score is last column I did :
awk -F"," 'NR>1{{for(i=4;i<=6;++i)printf $i""FS }
{sum=0; for(g=8;g<=NF;g++)
sum+=$g
print $i,"Total score:"sum ; print ""}}' data.csv
when i try
awk -F"," 'NR>1{{for(i=4;i<=6;++i)printf $i""FS }
{sum=0; for(g=8;g<=NF;g++)
sum+=$g
print $i,"Total score:"sum ; print "" | "sort -k1,2n"}}' data.csv
It gave me error, I only want to sort total score column, is there anything I did it wrong? Any helps are appreciated
First, assuming there are really not blank lined in between each line of data in data.csv, all you need is sort, you don't need awk at all. For example, since there is only ':' before the total score you want to sort descending by, you can use:
sort -t: -k2,2rn data.csv
Where -t: tells sort to use ':' as the field separator and then the KEYDEF -k2,2rn tell sort to use the 2nd field (what's after the ':' to sort by) and the rn says use a reverse numeric sort on that field.
Example Use/Output
With your data (without blank lines) in data.csv, you would have:
$ sort -t: -k2,2rn data.csv
2014,Honda,Accord,83 Total score:112
1998,Subaru,Legacy,23 Total score:62
2008,Honda,Accord,88 Total score:48
2012,Volkswagen,Golf,59 Total score:28
2016,Bmw,M2,2 Total score:24
2001,Dodge,Viper,42 Total score:8
2015,Chevy,Camaro,0 Total score:0
Which is your sort by Total score in descending order. If you want ascending order, just remove the r from -k2,2rn.
If you do have blank lines, you can remove them before the sort with sed -i '/^$/d' data.csv.
Sorting by number Before "Total score"
If you want to sort by the number that begins the XX Total score: yy field (e.g. XX), you can use sort with the field separator being a ',' and then your KEYDEF would be -k4.1,4.3rn which just says sort using the 4th field character 1 through character 3 by the same reverse numeric, e.g.
sort -t, -k4.1,4.3rn data.csv
Example Use/Output
In this case, sorting by the number before Total score in descending order would result in:
$ sort -t, -k4.1,4.3rn data.csv
2008,Honda,Accord,88 Total score:48
2014,Honda,Accord,83 Total score:112
2012,Volkswagen,Golf,59 Total score:28
2001,Dodge,Viper,42 Total score:8
1998,Subaru,Legacy,23 Total score:62
2016,Bmw,M2,2 Total score:24
2015,Chevy,Camaro,0 Total score:0
After posting the original solution I noticed it was ambiguous as which of the numbers on the 4th field you intended to sort on. In either case, here are both solutions. Let me know if you have further questions.

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

Grep logs for occurrences per second

I am trying to search logs for a range of time looking for the number of occurrences a specific account has. For instance I am running this now:
sed ā€˜/23:50:28/,/23:55:02/! dā€™ log.log | grep account_number | wc -l
Which nicely returns the total number of times this account might have entries given the time frame per second. My question is how can I also get a list of all those occurrences by each time entry? Example:
23:50:28 - 2
23:50:29 - 1
23:50:30 - 3
etc.
etc.
Thanks
awk to the rescue!
awk ā€˜/23:50:28/,/23:55:02/{if(/account_number/) a[$1]++}
END{for(k in a) print k " - " a[k]}' log | sort
obviously not tested since there is no sample input.

what is the meaning of delimiter in cut and why in this command it is sorting twice?

I am trying to find the reason of this command and as I know very basic I found that
last | cut -d" " -f 1 | sort | uniq -c | sort
last = Last searches back through the file /var/log/wtmp (or the file designated by the -f flag) and displays a list of all users logged in (and out) since that file was created.
cut is to show the desired column.
The option -d specifies what is the field delimiter that is used in the input file.
-f specifies which field you want to extract
1 is the out put I think which I am not sure
and the it is sorting and then it is
Uniq command is helpful to remove or detect duplicate entries in a file. This tutorial explains few most frequently used uniq command line options that you might find helpful.
If anyone can explain this command and also explain why there is two sorts I will appreciate it.
You are right on your explanation of cut: cut -d" " -f1 (no need of space after f) gets the first field of a stream based on delimiter " " (space).
Then why sort | uniq -c | sort?
From man uniq:
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.
That's why you need to sort the lines before piping to uniq. Finally, as uniq output is not sorted, you need to sort again to see the most repeated items first.
See an example of sort and uniq -c for a given file with repeated items:
$ seq 5 >>a
$ seq 5 >>a
$ cat a
1
2
3
4
5
1
2
3
4
5
$ sort a | uniq -c | sort <--- no repeated matches
2 1
2 2
2 3
2 4
2 5
$ uniq -c a | sort <---- repeated matches
1 1
1 1
1 2
1 2
1 3
1 3
1 4
1 4
1 5
1 5
Note you can do the sort | uniq -c all together with this awk:
last | awk '{a[$1]++} END{for (i in a) print i, a[i]}'
This will store in the a[] array the values of the first column and increase the counter whenever it finds more. In the END{} blocks it prints the results, unsorted, so you could pipe again to sort.
uniq -c is being used to create a frequency histogram. The reason for the second sort is that you are then sorting your histogram by frequency order.
The reason for the first sort is that uniq is only comparing each line to its previous when deciding whether the line is unique or not.

Bash- sum values from an array in one line

I have this array:
array=(1 2 3 4 4 3 4 3)
I can get the largest number with:
echo "num: $(printf "%d\n" ${array[#]} | sort -nr | head -n 1)"
#outputs 4
But i want to get all 4's add sum them up, meaning I want it to output 12 (there are 3 occurrences of 4) instead. any ideas?
dc <<<"$(printf '%d\n' "${array[#]}" | sort -n | uniq -c | tail -n 1) * p"
sort to get max value at end
uniq -c to get only unique values, with a count of how many times they appear
tail to get only the last line (with the max value and its count)
dc to multiply the value by the count
I picked dc for the multiplication step because it's RPN, so you don't have to split up the uniq -c output and insert anything in the middle of it - just add stuff to the end.
Using awk:
$ printf "%d\n" "${array[#]}" | sort -nr | awk 'NR>1 && p!=$0{print x;exit;}{x+=$0;p=$0;}'
12
Using sort, the numbers are sorted(-n) in reverse(-r) order, and the awk keeps summing the numbers till it finds a number which is different from the previous one.
You can do this with awk:
awk -v RS=" " '{sum[$0]+=$0; if($0>max) max=$0} END{print sum[max]}' <<<"${array[#]}"
Setting RS (record separator) to space allows you to read your array entries as separate records.
sum[$0]+=$0; means sum is a map of cumulative sums for each input value; if($0>max) max=$0 calculates the max number seen so far; END{print sum[max]} prints the sum for the larges number seen at the end.
<<<"${array[#]}" is a here-document that allows you to feed a string (in this case all elements of the array) as stdin into awk.
This way there is no piping or looping involved - a single command does all the work.
Using only bash:
echo $((${array// /+}))
Replace all spaces with plus, and evaluate using double-parentheses expression.

Resources