How to group by two fields using bash scripting? - linux

Here is one example of one line of log:
2016-04-24 23:59:45 -1 6bd3fbb8-65ac-4d16-bf32-48659a76c499 2 +15173583107 14 +161760555935 14 de.xxxx-O2 layxxxd 0 1
I know how to group by one filed, so this is the solution:
awk '{arr[$11]+=$12} END {for (i in arr) {print i,arr[i]}}' exmaple.log
and this would be results:
xx 144
layxxxd 49.267
My question is that how can I group by two fields instead of one, first should be $11 and second is $10? So results should change to:
layxxxd unknown 100
layxxxd de.xxxx-O2 44

how can I group by two fields instead of one, first should be $11 and second is $10?
You can use $11 FS $10 as your key for associative array:
awk '{arr[$11 FS $10] += $12} END {for (i in arr) {print i,arr[i]}}' exmaple.log

Related

Splitting file based on first column's first character and length

I want to split a .txt into two, with one file having all lines where the first column's first character is "A" and the total of characters in the first column is 6, while the other file has all the rest. Searching led me to find the awk command and ways to separate files based on the first character, but I couldn't find any way to separate it based on column length.
I'm not familiar with awk, so what I tried (to no avail) was awk -F '|' '$1 == "A*****" {print > ("BeginsWithA.txt"); next} {print > ("Rest.txt")}' FileToSplit.txt.
Any help or pointers to the right direction would be very appreciated.
EDIT: As RavinderSingh13 reminded, it would be best for me to put some samples/examples of input and expected output.
So, here's an input example:
#FileToSplit.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
A35646|Line 3|Stuff 3
641|Line 4|Stuff 4
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
413|Line 7|Stuff 7
What the expected output is:
#BeginsWith6.txt#
A35646|Line 3|Stuff 3
A48029|Line 5|Stuff 5
A32100|Line 6|Stuff 6
#Rest.txt#
2134|Line 1|Stuff 1
31516784|Line 2|Stuff 2
641|Line 4|Stuff 4
413|Line 7|Stuff 7
What you want to do is use a regex and length function. You don't show your input, so I will leave it to you to set the field separator. Given your description, you could do:
awk '/^A/ && length($1) == 6 { print > "file_a.txt"; next } { print > "file_b.txt" }' file
Which would take the information in file and if the first field begins with "A" and is 6 characters in length, the record is written to file_a.txt, otherwise the record is written to file_b.txt (adjust names as needed)
A non-regex awk solution:
awk -F'|' '{print $0>(index($1,"A")==1 && length($1)==6 ? "file_a.txt" : "file_b.txt")}' file
With your shown samples, could you please try following. Since your shown samples are NOT started from A so I have not added that Logic here, also this solution makes sure 1st field is all 6 digits long as per shown samples.
awk -F'|' '$1~/^[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
2nd solution: In case your 1st field starts from A following with 5 digits(which you state but not there in your shown samples) then try following.
awk -F'|' '$1~/^A[0-9]+$/ && length($1)==6{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file
OR(better version of above):
awk -F'|' '$1~/^A[0-9]{5}$/{print > ("BeginsWith6.txt");next} {print > ("rest.txt")}' Input_file

Bash addition by the first column, if data in second column is the same

I have a list with delimiters |
40|192.168.1.2|user4
42|192.168.1.25|user2
58|192.168.1.55|user3
118|192.168.1.3|user11
67|192.168.1.25|user2
As you can see, I have the same ip in the field 42|192.168.1.25|user2 and in the field 67|192.168.1.25|user2. How can I append these lines between them ? Can you give me a solution using awk. Can you give me some examples ?
I need in a result something like this:
40|192.168.1.2|user4
58|192.168.1.55|user3
109|192.168.1.25|user2
118|192.168.1.3|user11
How you can see, we have counted the numbers from first column.
If you need output in same order in which Input_file is there then following awk may help you in same.
awk -F"|" '!c[$2,$3]++{val++;v[val]=$2$3} {a[$2,$3]+=$1;b[$2,$3]=$2 FS $3;} END{for(j=1;j<=val;j++){print a[v[j]] FS b[v[j]]}}' SUBSEP="" Input_file
Adding a non-one liner form of solution too now.
awk -F"|" ' ##Making field separator as pipe(|) here for all the lines for Input_file.
!c[$2,$3]++{ ##Checking if array C whose index is $2,$3 is having its first occurrence in array c then do following.
val++; ##incrementing variable val value with 1 each time cursor comes here.
v[val]=$2$3 ##creating an array named v whose index is val and value is $2$3(second field 3rd field).
} ##Closing c array block here now.
{
a[$2,$3]+=$1; ##creating an array named a whose index is $2 $3 and incrementing its value with 1st field value and add in its same index values to get SUM.
b[$2,$3]=$2 FS $3;##create array b with index of $2$3 and setting its value to $2 FS $3, where FS is field separator.
} ##closing this block here.
END{ ##Starting awk code END bock here.
for(j=1;j<=val;j++){ ##starting a for loop here from variable named j value 1 to till value of variable val here.
print a[v[j]] FS b[v[j]] ##printing value of array a whose index is value of array v with index j, and array b with index of array v with index j here.
}}
' SUBSEP="" Input_file ##Setting SUBSEP to NULL here and mentioning the Input_file name here.
Short GNU datamash + awk solution:
datamash -st'|' -g2,3 sum 1 <file | awk -F'|' '{print $3,$1,$2}' OFS='|'
g2,3 - group by the 2nd and 3rd field (i.e. by IP address and user id)
sum 1 - sum the 1st field values within grouped records
The output:
40|192.168.1.2|user4
109|192.168.1.25|user2
118|192.168.1.3|user11
58|192.168.1.55|user3
Modifying the sample data to include different users for ip address 192.168.1.25:
$ cat ipfile
40|192.168.1.2|user4
42|192.168.1.25|user1 <=== same ip, different user
58|192.168.1.55|user3
118|192.168.1.3|user11
67|192.168.1.25|user9 <=== same ip, different user
And a simple awk script:
$ awk '
BEGIN { FS="|" ; OFS="|" }
{ sum[$2]+=$1 ; if (user[$2]=="") { user[$2]=$3 } }
END { for (idx in sum) { print sum[idx],idx,user[idx] } }
' ipfile
58|192.168.1.55|user3
40|192.168.1.2|user4
118|192.168.1.3|user11
109|192.168.1.25|user1 <=== captured first user id
BEGIN { FS="|" ; OFS="|" } : define input and output field separators; executed once at beginning
sum[$2]+=$1 : store/add field #1 to array (indexed by ip address == field #2); executed once for each row in data file
if .... : if a user hasn't already been stored for a given ip address, then store it now; this has the effect of saving the first user id we find for a given ip address; executed once for each row in data file
END { for .... / print ...} : loop through array indexes, printing our sum, ip address and (first) user id; executed once at the end
NOTE: No sorting requirement was provided in the original question; sorting could be added as needed ...
awk to the rescue!
$ awk 'BEGIN {FS=OFS="|"}
{a[$2 FS $3]+=$1}
END {for(k in a) print a[k],k}' file | sort -n
40|192.168.1.2|user4
58|192.168.1.55|user3
109|192.168.1.25|user2
118|192.168.1.3|user11
if user* is not part of the key and you want to capture the first value
$ awk 'BEGIN {FS=OFS="|"}
{c[$2]+=$1;
if(!($2 in u)) u[$2]=$3} # capture first user
END {for(k in c) print c[k],k,u[k]}' file | sort -n
which ends up almost the same with #markp's answer.
Another idea on the same path but allows for different users:
awk -F'|' '{c[$2] += $1}u[$2] !~ $3{u[$2] = (u[$2]?u[$2]",":"")$3}END{for(i in c)print c[i],i,u[i]}' OFS='|' input_file
If multiple users they will be separated by a comma

counting string length before and after a match, line by line in bash or sed

I have a file 'test' of DNA sequences, each with a header or ID like so:
>new
ATCGGC
>two
ACGGCTGGG
>tre
ACAACGGTAGCTACTATACGGTCGTATTTTTT
I would like to print the length of each contiguous string before and after a match to a given string, e.g. CGG
The output would then look like this:
>new
2 1
>two
1 5
>tre
4 11 11
or could just have the character lengths before and after matches for each line.
2 1
1 5
4 11 11
My first attempts used sed to print the next line after finding '>' ,then found the byte offset for each grep match of "CGG", which I was going to use to convert to lengths, but this produced the following:
sed -n '/>/ {n;p}' test | grep -aob "CGG"
2:CGG
8:CGG
21:CGG
35:CGG
Essentially, grep is printing the byte offset for each match, counting up, while I want the byte offset for each line independently (i.e. resetting after each line).
I suppose I need to use sed for the search as well, as it operates line by line, but Im not sure how to counnt the byte offset or characters in a given string.
Any help would be much appreciated.
By using your given string as the field separator in awk, it's as easy as iterating through the fields on each line and printing their lengths. (Lines starting with > we just print as they are.)
This gives the desired output for your sample data, though you'll probably want to check edge cases like starts with CGG, ends with CGG, only contains CGG, etc.
$ awk -F CGG '/^>/ {print; next} {for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}' file.txt
>new
2 1
>two
1 5
>tre
4 11 11
awk -F CGG
Invoke awk using "CGG" as the field separator. This parses each line into a set of fields separated by each (if any) occurrence of the string "CGG". The "CGG" strings themselves are neither included as nor part of any field.
Thus the line ACAACGGTAGCTACTATACGGTCGTATTTTTT is parsed into the three fields: ACAA, TAGCTACTATA, and TCGTATTTTTT, denoted in the awk program by $1, $2, and $3, respectively.
'/^>/ {print; next}
This pattern/action tells awk that if the line starts with > to print the line and go immediately to the next line of input, without considering any further patterns or actions in the awk program.
{for (i=1; i<=NF; ++i) {printf "%s%s", length($i), (i==NF)?"\n":" "}}
If we arrive to this action, we know the line did not start with > (see above). Since there is only an action and no pattern, the action is executed for every line of input that arrives here.
The for loop iterates through all the fields (NF is a special awk variable that contains the number of fields in the current line) and prints their length. By checking if we've arrived at the last field, we know whether to print a newline or just a space.

Emacs: how to concatenate two rows together to form unique identifier? [duplicate]

Input where identifier specified by two rows 1-2
L1_I L1_I C-14 <---| unique idenfier
WWPTH WWPT WWPTH <---| on two rows
1 2 3
Goal: how to concatenate the rows?
L1_IWWPTH L1_IWWPT C-14WWPTH <--- unique identifier
1 2 3
P.s. I will accept the simplest and most elegant solution.
Assuming that the input is in a file called file:
$ awk 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next} 1' file
L1_IWWPTH L1_IWWPT C-14WWPTH
1 2 3
How it works
NR==1{for (i=1;i<=NF;i++) a[i]=$i;next}
For the first line, save all the column headings in the array a. Then, skip over the rest of the commands and jump to the next line.
NR==2{for (i=1;i<=NF;i++) printf "%-20s",a[i] $i;print"";next}
For the second line, print all the column headings, merging together the ones from the first and second rows. Then, skip over the rest of the commands and jump to the next line.
1
1 is awk's cryptic shorthand for print the line as is. This is done for all lines after the seconds.
Tab-separated columns with possible missing columns
If columns are tab-separated:
awk -F'\t' 'NR==1{for (i=1;i<=NF;i++) a[i]=$i;next} NR==2{for (i=1;i<=NF;i++) printf "%s\t",a[i] $i;print"";next} 1' file
If you plan to use python, you can use zip in the following way:
input = [['L1_I', 'L1_I', 'C-14'], ['WWPTH','WWPT','WWPTH'],[1,2,3]]
output = [[i+j for i,j in zip(input[0],input[1])]] + input[2:]
print output
output:
[['L1_IWWPTH', 'L1_IWWPT', 'C-14WWPTH'], [1, 2, 3]]
#!/usr/bin/awk -f
NR == 1 {
split($0, a)
next
}
NR == 2 {
for (b in a)
printf "%-20s", a[b] $b
print ""
next
}
1

AWK epoch diff with current and previous line

I have a file like this called Sample:-
206,,,206,14.9,0,2012/04/24 00:00:05
206,,,206,14.9,0,2012/04/24 00:00:21
205,,,205,14.9,0,2012/04/24 00:00:23
205,,,205,14.9,0,2012/04/24 00:00:29
207,,,207,14.9,0,2012/04/24 00:00:34
205,,,205,14.9,0,2012/04/24 00:00:40
204,,,204,14.9,0,2012/04/24 00:00:46
202,,,202,14.9,0,2012/04/24 00:00:52
201,,,201,14.9,0,2012/04/24 00:01:00
202,,,202,14.9,0,2012/04/24 00:01:04
And the following AWK command:-
awk -F, '{ gsub("/"," ",$7); gsub(":"," ",$7); t+=(mktime($7)-mktime(p)); printf ("%s,%s,%s\n",mktime($7),mktime(p),t); p=$7 }' Sample
Giving the following output:-
1335222005,-1,1335222006
1335222021,1335222005,1335222022
1335222023,1335222021,1335222024
1335222029,1335222023,1335222030
1335222034,1335222029,1335222035
1335222040,1335222034,1335222041
1335222046,1335222040,1335222047
1335222052,1335222046,1335222053
1335222060,1335222052,1335222061
1335222064,1335222060,1335222065
for each line, the 7th column is converted to an epoch date and the difference between the epoch date on the previous line is calculated and added to t.
On the first line being processed, because p is currently not a date, mktime returns -1 throwing out my figures.
What I want to do is, tell the AWK script, if line 1 is being processed then assume the difference is 6. At the moment it is subtracting -1 from 1335222005 resulting in 1335222006.
I want to say, start t at 6, then on the second line, work out the difference in epoch seconds to the previous line and increment t by that amount.
You just need to do something special for line 1.
awk -F, '
{gsub(/[\/:]/," ",$7); this_time = mktime($7)}
NR != 1 {t += this_time - prev_time; print this_time, prev_time, t}
{prev_time = this_time}
' << END
Given your input data, this prints
1335240021 1335240005 16
1335240023 1335240021 18
1335240029 1335240023 24
1335240034 1335240029 29
1335240040 1335240034 35
1335240046 1335240040 41
1335240052 1335240046 47
1335240060 1335240052 55
1335240064 1335240060 59
Alternately, a convenient way to initialize a variable is with awk'f -v option
awk -v t=6 '... same as before ...'
In awk you can initialize a variable in a BEGIN block, and exist two variables to get line number, both are useful for your case, FNR and NR:
BEGIN { t = 6 }
or
FNR == 1 { t = 6 }
Would using BEGIN (see here) help?
That will allow initialization of t variable to whatever you want. Something like
awk -F, 'BEGIN {t=6} { gsub("/"," ",$7); gsub(":"," ",$7); t+=(mktime($7)-mktime(p)); printf ("%s,%s,%s\n",mktime($7),mktime(p),t); p=$7 }' Sample

Resources