Linux group by, sum and count

Linux group by, sum and count - linux

From a directory listing, I have created an output that presents file size in column 1 and a section of the filename (it's a date) in column 2.
178694671 2017-10-14
175332227 2017-10-14
175021608 2017-10-14
174851281 2017-10-14
175316643 2017-10-14
What I now need to do is group by, sum and count on this list. Group by and count files by column 2, and sum the file size associated with each grouping.
The result of the above output would look like this:
879216430 2017-10-14 5
I tried this
awk '{sum[$1]+= $2;}END{for (date in sum){print sum[date], date;}}'
But it provides strange results and I don't really understand what it's doing.
Can anyone help?

Use another associated array to store frequency of date as in:
awk '{++freq[$2]; sum[$2]+=$1}
END{for (date in sum) print sum[date], date, freq[date]}' file
879216430 2017-10-14 5
Also note key of your array would be $2 i.e. date not $1

Related

Printing a specific number of nucleotides

I have a vcf statistics for heterozygote and homozygote cases and I would like to find matches with my maf file. The issue is that the reference field in maf file is different and it exlcudes nucleotides in alternative states, e.g. if you have a ref CAA and alternative variant is CAAAAA, in maf file your ref would be AAA.
So I need a code to change the ref field and alt in my file with statistics (may be add separate columns ref2 and alt2)
Here is a snippet of my file:
CHR POS ID REF ALT chr11 71579744 rs71049992 A ACAGCAGCTGGACTGGGAGCAGCAGGACCTG (insertion case)
chr11 124880551 rs71859853 CCGGAGT C (deletion case)
I think I should first count numbers of nucleotides in column4 and 5. then if number in column 4 is greater than 5 (meaning deletion), then in my ref2 that position will start from the next nucleotide different from alternative one.
For insertion, I will have an alt site changed and skipped ref nucleotides
As a result, I would like to have this:
CHR POS ID REF ALT REF2 ALT2
chr11 71579744 rs71049992 A ACAGCAGCTGGACTGGGAGCAGCAGGACCTG A CAGCAGCTGGACTGGGAGCAGCAGGACCTG
chr11 124880551 rs71859853 CCGGAGT C CGGAGT C
Thank you very much in advance!

I think I should first count numbers of nucleotides in column4 and 5…
With awk, you can use the length function to count numbers of nucleotides:
awk 'NR==1 {print $0" REF2 ALT2"} # assuming first line has column headers
NR>1 {if (length($4)<length($5)) print $0, $4, gensub($4, "", 1, $5)
else print $0, gensub($5, "", 1, $4), $5}' file

Edit values in one column in 4,000,000 row CSV file

I have a CSV file I am trying to edit to add a numeric ID-type column in with unique integers from 1 - approx 4,000,000. Some of the fields already have an ID value, so I was hoping I could just sort those and then fill in starting on the largest value + 1. However, I cannot open this file to edit in Excel because of its size (I can only see the max of 1,048,000 or whatever rows). Is there an easy way to do this? I am not familiar with coding, so I was hoping there was a way to do it manually that is similar to Excel's fill series feature.
Thanks!
-also - I know there are threads on how to edit a large CSV file, but I was hoping for help with how to edit this specific feature. Thanks!
-I want to basically sort the rows based on idnumber and then add unique IDs to rows without that ID value
Screenshot of file

one way, using Notepad++, and a plugin named SQL:
Load the CSV in Notepad++
SELECT a+1,b,c FROM data
Hit 'start'
When starting with a file like this:
a,b,c
1,2,3
4,5,6
7,8,9
The results after look like this:
SQL Plugin 1.0.1025
Query : select a+1,b,c from data
Sourcefile : abc.csv
Delimiter : ,
Number of hits: 3
===================================================================================
Query result:
2,2,3
5,5,6
8,8,9
Or, in words, the first column is incremented by 1.
2nd solution, using gawk, downloaded from https://www.klabaster.com/freeware.htm#mawk:
D:\TEMP>type abc.csv
a,b,c
1,2,3
4,5,6
7,8,9
D:\TEMP>gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print $1+1,$2,$3 }" abc.csv
a,b,c
2,2,3
5,5,6
8,8,9
(g)awk id a tool which reads a file line by line. The line is then accessible via $0, and the parts from the line via $1,$2,$3,... using a separator.
This separator is set in my example (FS=OFS=\",\";) in the BEGIN section which is only done once per input file. Do not get confused by the \". This is because the script is between double quotes, and a variable (like OFS) is set using double quotes too, so it needs to be escaped like \".
The getline; print $0, do take care of the first line in a CSV which typically hold column names.
Then, for every line, this piece of code print $1+1,$2,$3 will increment the first column, and print the second and third column.
To extend this second example:
gawk "BEGIN{ FS=OFS=\",\"; getline; print $0 }{ print ($1<5?$1+1:$1),$2,$3 }" abc.csv
The ($1<5?$1+1:$1) will check if value of $1is less then 5 ($1<5), if true, it will return $1+1, and else $1. Or, in words, it will only add 1 if the current value is less than 5.
With your data you end up with something like this (untested!):
gawk "BEGIN{ FS=OFS=\",\"; getline; a=42; print $0 }{ if($4+0==0){ a++ }; print ($4<=0?$a:$1),$2,$3 }" input.csv
a=42 to set the initial value for the column values which needs to be update (you need to change this to the correct value )
The if($4+0==0){ a++ } will increment the value of a when the fourth column equals 0 (The $4+0 is done to convert empty values like "" to a numeric value 0).

Delete repeated rows based on column 5 and keep the one with highest value in column 13

I have a spreadsheet (.csv) with 20 columns and 9000 rows. I want to delete rows that have the same ID in column 5, so I will end up with only one entry (or row) per ID number (unique ID). If there are 2 or more rows with the same ID in column 5, I want to keep the one that has the highest score in column 13. At the same time I want to keep all 20 columns for each row (all the information). Rows with repeated ID and lower score are not important, so I want to just remove those.
I was trying with awk and pearl, but somehow I only manage to do it half way. Let me know if I need to provide more information. Thanks!
INPUT (delimeter=','):
geneID, Score, annotation, etc.
ENSG0123, 532.0, intergenic, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0123, 234.0, 5-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.
OUTPUT:
geneID, Score, annotation, etc.
ENSG0123, 689.4, 3-UTR, etc.
ENSG0399, 567.8, 5-UTR, etc.

since you didn't give the complete input/output example, I guess it was a generic problem. So here is the answer:
sort -t',' -k5,5n -k13,13nr file.csv|awk -F, '!a[$5]++'
although awk can do it alone, but with help of sort the code could be much easier. What the above one-liner does:
sort the file by col5 and col13(numerically, descending)
pass the sorted result to awk to remove duplicates, base on col5.
here is a little test on it, in the example, col1 is your col5, and col3 is your col13:
kent$ cat f
1,2,3
2,8,7
1,2,4
1,4,5
2,2,8
1,3,6
2,2,9
1,2,10
LsyHP 12:38:04 /tmp/test
kent$ sort -t',' -k1,1n -k3,3nr f|awk -F, '!a[$1]++'
1,2,10
2,2,9

How to sort by column and break ties randomly

I have a tab-delimited file with three columns like this:
joe W 4
bob A 1
ana F 1
roy J 3
sam S 0
don R 2
tim L 0
cyb M 0
I want to sort this file by decreasing values in the third column, but to break ties I do not want to use some other column to do so (i.e. not use the first column to sort rows with the same entry in the third column).
Instead, I want rows with the same third column entries to either preserve the original order, or be sorted randomly.
Is there a way to do this using the sort command in unix?

sort -k3 -r -s file
This should give you the required output.
-k3 denotes the 3rd column and -r will sort in decreasing order and -s will disable the breaking of ties using other options.

Bash Script - Divide Colum 2 by Colum in the middle but keep 1 and 4 on either side

I have a list that has an ID, population, area and province, that looks like this:
1:517000:405212:Newfoundland and Labrador
2:137900:5660:Prince Edward Island
3:751400:72908:New Brunswick
4:938134:55284:Nova Scotia
5:7560592:1542056:Quebec
6:12439755:1076359:Ontario
7:1170300:647797:Manitoba
8:996194:651036:Saskatchewan
9:3183312:661848:Alberta
10:4168123:944735:British Comumbia
11:42800:1346106:Northwest Territories
12:31200:482443:Yukon Territories
13:29300:2093190:Nunavut
I need display the names of the provinces with the lowest and highest population density (population/area). How can I divide column 1 by column 2 (2 decimal places) but keep the file information in tact on either side (eg. 1:1.28:Newfoundland and Labrador). After that I figure I can just pump it into sort -t: -nk2 | head -n 1 and sort -t: -nrk2 | head -n 1 to pull them.
The recommended command given was grep.

Since you seem to have the sorting and extraction under control, here's an example awk script that should work for you:
#!/usr/bin/env awk -f
BEGIN {
FS=":"
OFS=":"
OFMT="%.2f"
}
{
print $1,$2/$3,$4
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Linux group by, sum and count - linux

Use another associated array to store frequency of date as in: awk '{++freq[$2]; sum[$2]+=$1} END{for (date in sum) print sum[date], date, freq[date]}' file 879216430 2017-10-14 5 Also note key of your array would be $2 i.e. date not $1

Related

Printing a specific number of nucleotides

Edit values in one column in 4,000,000 row CSV file

Delete repeated rows based on column 5 and keep the one with highest value in column 13

How to sort by column and break ties randomly

Bash Script - Divide Colum 2 by Colum in the middle but keep 1 and 4 on either side

Categories

Resources