I have a large file containing data like this:
a 23
b 8
a 22
b 1
I want to be able to get this:
a 45
b 9
I can first sort this file and then do it in Python by scanning the file once. What is a good direct command-line way of doing this?
Edit: The modern (GNU/Linux) solution, as mentioned in comments years ago ;-) .
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort -k1,1
The originally posted solution, based on old Unix sort options:
awk '{
arr[$1]+=$2
}
END {
for (key in arr) printf("%s\t%s\n", key, arr[key])
}' file \
| sort +0n -1
I hope this helps.
No need for awk here, or even sort -- if you have Bash 4.0, you can use associative arrays:
#!/bin/bash
declare -A values
while read key value; do
values["$key"]=$(( $value + ${values[$key]:-0} ))
done
for key in "${!values[#]}"; do
printf "%s %s\n" "$key" "${values[$key]}"
done
...or, if you sort the file first (which will be more memory-efficient; GNU sort is able to do tricks to sort files larger than memory, which a naive script -- whether in awk, python or shell -- typically won't), you can do this in a way which will work in older versions (I expect the following to work through bash 2.0):
#!/bin/bash
read cur_key cur_value
while read key value; do
if [[ $key = "$cur_key" ]] ; then
cur_value=$(( cur_value + value ))
else
printf "%s %s\n" "$cur_key" "$cur_value"
cur_key="$key"
cur_value="$value"
fi
done
printf "%s %s\n" "$cur_key" "$cur_value"
This Perl one-liner seems to do the job:
perl -nle '($k, $v) = split; $s{$k} += $v; END {$, = " "; foreach $k (sort keys %s) {print $k, $s{$k}}}' inputfile
This can be easily achieved with the following single-liner:
cat /path/to/file | termsql "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Or.
termsql -i /path/to/file "SELECT col0, SUM(col1) FROM tbl GROUP BY col0"
Here a Python package, termsql, is used, which is a wrapper around SQLite. Note, that currently it's not upload to PyPI, and also can only be installed system-wide (setup.py is a little broken), like:
pip install --user https://github.com/tobimensch/termsql/archive/master.zip
Update
In 2020 version 1.0 was finally uploaded to PyPI, so pip install --user termsql can be used.
One way using perl:
perl -ane '
next unless #F == 2;
$h{ $F[0] } += $F[1];
END {
printf qq[%s %d\n], $_, $h{ $_ } for sort keys %h;
}
' infile
Content of infile:
a 23
b 8
a 22
b 1
Output:
a 45
b 9
With GNU awk (versions less than 4):
WHINY_USERS= awk 'END {
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
With GNU awk >= 4:
awk 'END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (E in a)
print E, a[E]
}
{ a[$1] += $2 }' infile
With sort + awk combination one could try following, without creating array.
sort -k1 Input_file |
awk '
prev!=$1 && prev{
print prev,(prevSum?prevSum:"N/A")
prev=prevSum=""
}
{
prev=$1
prevSum+=$2
}
END{
if(prev){
print prev,(prevSum?prevSum:"N/A")
}
}'
Explanation: Adding detailed explanation for above.
sort -k1 file1 | ##Using sort command to sort Input_file by 1st field and sending output to awk as an input.
awk ' ##Starting awk program from here.
prev!=$1 && prev{ ##Checking condition prev is NOT equal to first field and prev is NOT NULL.
print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
prev=prevSum="" ##Nullify prev and prevSum here.
}
{
prev=$1 ##Assigning 1st field to prev here.
prevSum+=$2 ##Adding 2nd field to prevSum.
}
END{ ##Starting END block of this awk program from here.
if(prev){ ##Checking condition if prev is NOT NULL then do following.
print prev,(prevSum?prevSum:"N/A") ##Printing prev and prevSum(if its NULL then print N/A).
}
}'
Related
I need to search and replace a pattern from file
[ec2_server]
server_host=something
[list_server]
server_host=old_name
to
[ec2_server]
server_host=something
[list_server]
server_host=new_name
I'm able to get it working with
awk '/\[list_server]/ { print; getline; $0 = "server_host=new_name" } 1'
But I'm trying to parameterize the search pattern, the parameter name to change and the parameter value to change.
PATTERN_TO_SEARCH=[list_server]
PARAM_NAME=server_host
PARAM_NEW_VALUE=new_name
But it is not working when I parameterize and pass the variables to awk
awk -v patt=$PATTERN_TO_SEARCH -v parm=$PARAM_NAME -v parmval=$PARAM_NEW_VALUE '/\patt/ { print; getline; $0 = "parm=parmval" } 1' file.txt
You have two instances of the same problem: you're trying to use a
variable name inside a string value. Awk can't read your mind: it
can't intuit that sometimes when your write "HOME" you mean "print the
value of the variable HOME" and other times you mean "print the word
HOME".
We need to make two separate changes:
First, to use a variable in your search pattern, you can use
syntax like this:
awk -v patt='some text' '$0 == patt {print}'
(Note that here we're using an equality match, ==; you can also use a regular expression match, ~, but in this particular case that would only complicate things).
With your example file content, running:
awk -v patt='[list_server]' '$0 == patt {print}' file.txt
Produces:
[list_server]
Next, when you write $0 = "parm=parmval", you're setting $0 to the literal string parm=parmval. If you want to perform variable substitution, consider using sprintf():
awk \
-v patt="$PATTERN_TO_SEARCH" \
-v parm="$PARAM_NAME" \
-v parmval="$PARAM_NEW_VALUE"\
'
$0 == patt { print; getline; $0 = sprintf("%s=%s\n", parm, parmval) } 1
' file.txt
Which gives us:
[ec2_server]
server_host=something
[list_server]
server_host=new_server
Have your awk code in following way, as experts recommend not to use getline(since it has edge cases in its use). So I am going with find the string and then set flag(custom variable made by me in program) and then print the line accordingly with using regex along with passed value from shell variable.
Along with matching and printing the new value we need to set field separator also to fetch correct value and replace/print it with new value. So I made field separator as = here for whole Input_file. By doing this approach you need not to pass any variable which has server_host value in it, since its already present in Input_file so we can take it from there.
awk solution with mentioning value within awk variable itself and then check regex in main program of awk for comparison.
awk -v var="list_server" -v newVal="NEW_VALUE" '
BEGIN{ FS=OFS="=" }
$0 ~ "^\\[" var "\\]$"{
found=1
print
next
}
found{
print $1 OFS newVal
found=""
next
}
1
' Input_file
OR awk solution to get value from shell variable and then use regex inside awk to match condition:
varS="list_server" ##Shell variable
newvalue="NEW_VALUE" ##Shell variable
awk -v var="$varS" -v newVal="$newvalue" '
BEGIN{ FS=OFS="=" }
$0 ~ "^\\[" var "\\]$"{
found=1
print
next
}
found{
print $1 OFS newVal
found=""
next
}
1
' Input_file
$ awk -v pat="$PATTERN_TO_SEARCH" -v parm="$PARAM_NAME" -v parmval="$PARAM_NEW_VALUE" '
f{$0=parm"="parmval; f=0} $0==pat{f=1} 1
' file
[ec2_server]
server_host=something
[list_server]
server_host=new_name
This makes the assumption "${PARAM_NAME}" immediately follows the search pattern row :
_P2S_='[list_server]'
_PNM_='server_host'
_PNV_='new_name'
echo "${...input...}" | gtee >( gpaste - | gcat -b >&2; echo ) | gcat - |
{m,n,g}awk -v __="${_P2S_}=${_PNM_}=${_PNV_}" -F= 'BEGIN {
$(_-=_)=__;___= $(_ = NF); FS ="^"(OFS = $--_ FS)
__= $-(_+=-_--) } (NR-_)< NF ? ($NF =___)^(_-=_) :_=NR*(-!!_)^(__!=$!_)' |
gcat -b | gcat -n | ecp
1 [ec2_server]
2 server_host=something
3 [list_server]
4 server_host=old_name
1 1 [ec2_server]
2 2 server_host=something
3
4 3 [list_server]
5 4 server_host=new_name
Recently, I had to sort several files according to records' ID; the catch was that there can be several types of records, and in each of those the field I had to use for sorting is on a different position. The fields, however, are easily identifiable thanks to key=value structure. To show a simple sample of the general structure:
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
I came up with a pipeline as follows, which did the job:
awk -F'[|=]' '{for(i=1; i<=NF; i++) {if($i ~ "id") {i++; print $i"?"$0} }}' tester.txt | sort -n | awk -F'?' '{print $2}'
In other words the algorithm is as follows:
Split the record by both field and key-value separators (| and =)
Iterate through the elements and search for the id key
Print the next element (value of id key), a separator, and the whole line
Sort numerically
Remove prepended identifier to preserve records' structure
Processing the sample gives the output:
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
Is there a way, though, to do this task using single awk command?
You may try this gnu-awk code to to this in a single command:
awk -F'|' '{
for(i=1; i<=NF; ++i)
if ($i ~ /^id=/) {
a[gensub(/^id=/, "", 1, $i)] = $0
break
}
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for (i in a)
print a[i]
}' file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
We are using | as field delimiter and when there is a column name starting with id= we store it in array a with index as text after = and value as the full record.
Using PROCINFO["sorted_in"] = "#ind_num_asc" we sort array a using numerical value of index and then in for loop we print value part to get the sorted output.
Using GNU awk for the 3rd arg to match() and sorted_in:
$ cat tst.awk
match($0,/(^|\|)id=([0-9]+)/,a) {
ids2vals[a[2]] = $0
}
END {
PROCINFO["sorted_in"] = "#ind_num_asc"
for ( id in ids2vals ) {
print ids2vals[id]
}
}
$ awk -f tst.awk file
fieldD=valueD|recordType=B|id=1|fieldE=valueE
fieldA=valueA|fieldB=valueB|recordType=A|id=2|fieldC=valueC
fieldF=valueF|fieldG=valueG|fieldH=valueH|recordType=C|id=3
Try Perl: perl -e 'print map { s/^.*? //; $_ } sort { $a <=> $b } map { ($id) = /id=(\d+)/; "$id $_" } <>' file
Some explanation of the code I use:
print #print the resulting list of lines
map {
s/^.*? //;
$_
} #remove numeric id from start of line
sort { $a <=> $b } #sort numerically
map {
($id) = /id=(\d+)/;
"$id $_"
} # capture id and place it in start of line
<> # read all lines from file
Or try sed and sort: sed 's/^\(.*id=\([0-9][0-9]*\).*\)$/\2 \1/' file | sort -n | sed 's/^[^ ][^ ]* //'
With your shown samples only, please try following(awk + sort + cut) solution, written and tested in GNU awk, should work in any awk.
awk '
match($0,/id=[0-9]+/){
print substr($0,RSTART,RLENGTH)";"$0
}
' Input_file | sort -t'=' -k2n | cut -d';' -f2-
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/id=[0-9]+/){ ##Using awk match function to match id= followed by digits.
print substr($0,RSTART,RLENGTH)";"$0 ##printing sub string of matched value followed by current line along with semi-colon in it.
}
' Input_file | ##Mentioning Input_file here and passing awk output as a standard input to next command.
sort -t'=' -k2n | ##Sorting output with delimiter of = and by 2nd field then passing output to next command as an input.
cut -d';' -f2- ##Using cut command making delimiter as ; and printing everything from 2nd field onwards.
I have a scenario
where i want to hash some columns of csv file
how to do that with below data
ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22
The Column name NAME | AGE should get hashed with random values
like below output
ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940
When i tested this code below code gives me same value for the column 'NAME' it should give randomized values
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv
output :
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
You may use it like this:
awk 'function hash(s, cmd, hex, line) {
cmd = "openssl md5 <<< \"" s "\""
if ( (cmd | getline line) > 0)
hex = line
close(cmd)
return hex
}
BEGIN {
FS = OFS = "|"
}
NR == 1 {
print
next
}
{
print $1, hash($2), $3, hash($4)
}' file
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6
You have to specify | as input and output field separators. Otherwise $2 is not what you expect, but an empty string.
awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv
prints
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22
Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:
$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9
Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:
$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7 -
$ echo -n AB1 | md5sum
1109867462b2f0f0470df8386036243c -
You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).
Perl has support for md5 out of the box:
perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) {
$F[$_] = md5_hex($F[$_]) for (1,3);
print join "|",#F
} else { print }'
online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)
Digest::MD5 is a core module, any perl installation should have it
-M'Digest::MD5 qw(md5_hex)' - this loads the md5_hex function
-l handle line endings
-F'\|' - autosplit fields on | (this implies -a and -n)
2..eof - range operator (or flip-flop as some want to call it) - true between line 2 and end of the file
$F[$_] = md5_hex($F[$_]) - replace field $_ with it's md5 sum
for (1,3) - statement modifier runs the statement for 1 and 3 aliasing $_ to them
print join "|",#F - print the modified fields
else { print } - this hanldes the header
Note about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)
time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",#F } else { print }' <sample2.txt > out4.txt
real 0m0.121s
user 0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt
real 1m14.205s
user 0m50.405s
sys 0m35.340s
md5.awk defines the md5 function as such:
$ cat md5.awk
function md5(str, cmd, l, hex) {
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
return hex
}
I'm using /bin/echo because there are some variants of shell where echo doesn't have -n
I'm using -n mostly because I want to be able to compare the results with the perl results
substr(l,0,32) - on my machine openssl md5 doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant part
I'm using a separate file because it seems much cleaner, and because I can switch between function implementations fairly easy
As I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl tool.
$ cat md5memo.awk
function md5(str, cmd, l, hex) {
if (cache[str])
return cache[str]
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
cache[str] = hex
return hex
}
With the above caching, the results improve dramatically:
$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt
real 0m0.192s
user 0m0.141s
sys 0m0.085s
[savuso#localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt
real 0m0.281s
user 0m0.222s
sys 0m0.088s
however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!
Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers
I have codded the following lines :
ARRAY=($(awk 'FS = ";" {print $3}' file.txt))
LINE_CREATOR=`echo "aaaa;bbbb;cccccccc" |
'{awk -F";"};
END
for (i in ARRAY)
{
print $'${ARRAY['i']}'
}
}'`
the File.txt looks like
1;8;3
4;6;1
7;9;2
Explanation :
the array contains the value : 3 1 2
so the loop will loop on the array , and extract fields $3 $1 $2 from the "aaaa;bbbb;cccccccc" using awk
and the final output should be this
ccccccccaaaabbbb
I still have some errors while launching my script.
I'm making a few guesses here but I think that this does what you want:
$ echo "aaaa;bbbb;cccccccc" | awk -F\; 'NR == FNR { n = split($0, a); next }
{ printf "%s", a[$3] } END { print "" }' - file
ccccccccaaaabbbb
NR == FNR means that the block is only run for the first input. - as an argument tells awk to read first from standard input. The string is split on FS (;) into the array a. next skips the rest of the script.
The second block is only run for the second input (the text file). The values in the third field are used to print the elements in the array a.
if you want to pass the index as an awk variable, here is another way
$ awk -F';' -v ix="$(cut -d\; -f3 file | paste -sd\;)" '
BEGIN{n=split(ix,a)}
{for(i=1;i<n;i++) printf "%s",$a[i];
printf "%s\n",$a[n]}' <<< "aaaa;bbbb;cccccccc"
ccccccccaaaabbbb
I have implimented a function that searches a column in a file for a string and it works well. What I would like to know how do I modify it to search all the columns fr a string?
awk -v s=$1 -v c=$2 '$c ~ s { print $0 }' $3
Thanks
If "all the columns" means "the entire file" then:
grep $string $file
Here is an example of one way to modify your current script to search for two different strings in two different columns. You can extend it to work for as many as you wish, however for more than a few it would be more efficient to do it another way.
awk -v s1="$1" -v c1="$2" -v s2="$3" -v c2="$4" '$c1 ~ s1 || $c2 ~ s2 { print $0 }' "$5"
As you can see, this technique won't scale well.
Another technique treats the column numbers and strings as a file and should scale better:
awk 'FNR == NR {strings[++c] = $1; columns[c] = $2; next}
{
for (i = 1; i <= c; i++) {
if ($columns[i] ~ strings[i]) {
print
}
}
}' < <(printf '%s %d\n' "${searches[#]}") inputfile
The array ${searches[#]} should contain strings and column numbers alternating.
There are several ways to populate ${searches[#]}. Here's one:
#!/bin/bash
# (this is bash and should precede the AWK above in the script file)
unset searches
for arg in "${#:1:$#-1}"
do
searches+=("$arg")
shift
done
inputfile=$1 # the last remaining argument
# now the AWK stuff goes here
To run the script, you'd do this:
$ ./scriptname foo 3 bar 7 baz 1 filename
awk -v pat="$string" '$0 ~ pat' infile