tr command in awk to change the column values - linux

I am using in my shell script TR command in awk to mask the data. Below example file affects only first line of the my file when i used tr command in awk. when i use the same in while loop and called the awk command inside of it then its working fine but it taking very long time to get completed. Now my requirement i want to mask many columns[example :$1, $5, $9] in the same file(file.txt) and this should affect the whole file not first line and i want to achieve this as much as faster to mask the data. Please advise
cat file.txt
========
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
OUTPUT
awk -F"," -v OFS="," '{ "echo \""$1"\" | tr \"a-c\" \"e-f\" | tr \"0-5\" \"6-9\"" | getline $1 }7' file.txt
effffhs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
abcbchs,degehek
abcbchs,degehek,lskjsjshsh
Expected output
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek,lskjsjshsh
effffhs,degehek
effffhs,degehek,lskjsjshsh

The code you found runs an external shell command pipeline on each input line. Like you discovered, that's an awfully inefficient way to do what you are asking. Awk isn't really an ideal choice for this task at all. Maybe try Perl.
perl -F, -lane '$F[$_] =~ tr/a-c/e-f/ =~ tr/0-5/6-9/ for (0, 4, 8); print join(",", #F)' file
The -F, option is like with Awk, but Perl doesn't automatically split the input line. With -a it does, splitting into an array named #F, and with -n it loops over all input lines. The -l is a convenience to remove newlines from each input line and adding one back when you print.
Notice how the columns are numbered from zero, not one, like in Awk; so the indices in the for loop access the first, fifth, and ninth elements of #F.

You forgot to close() the command after every invocation. Here's the correct way to write it:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
cmd="echo '" $1 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
$1 = ( (cmd | getline line) > 0 ? line : $1 )
close(cmd)
print
}
$ awk -f tst.awk file
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek
effffhs,degehek,lskjsjshsh
effffhs,degehek
effffhs,degehek,lskjsjshsh
You also didn't protect yourself from getline failures, hence the extra complexity around the getline call, see http://awk.info/?tip/getline.
Given your comments, this shows how to modify multiple fields (1, 3, and 5 in this case) simultaneously:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
cmd = "echo '" $0 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
new = ( (cmd | getline line) > 0 ? line : $1 )
close(cmd)
split(new,tmp)
for (i in tmp) {
if (i ~ /^(1|3|5)$/) {
$i = tmp[i]
}
}
print
}
$ cat file
abc,abc,abc,abc,abc
abc,abc,abc,abc,abc,abc,abc
abc,abc,abc,abc,abc,abc
abc,abc,abc,abc
$ awk -f tst.awk file
eff,abc,eff,abc,eff
eff,abc,eff,abc,eff,abc,abc
eff,abc,eff,abc,eff,abc
eff,abc,eff,abc
To handle quotes in the input data:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
gsub(/'/,SUBSEP)
cmd = "echo '" $0 "' | tr 'a-c' 'e-f' | tr '0-5' '6-9'"
new = ( (cmd | getline line) > 0 ? line : $1 )
close(cmd)
split(new,tmp)
for (i in tmp) {
if (i ~ /^(1|3|5)$/) {
$i = tmp[i]
}
}
gsub(SUBSEP,"'")
print
}
$ cat file
a'c,abc,a"c,abc,abc
abc,a'c,abc,a"c,abc,abc,abc
abc,abc,abc,abc,abc,abc
abc,abc,abc,abc
$ awk -f tst.awk file
e'f,abc,e"f,abc,eff
eff,a'c,eff,a"c,eff,abc,abc
eff,abc,eff,abc,eff,abc
eff,abc,eff,abc
If you don't have any particular control char that's guaranteed not to appear in your input, you can create a non-existent string to use instead of SUBSEP above by using the technique described at the end of https://stackoverflow.com/a/29237745/1745001

Related

How to hash particular column in csv file | linux |

I have a scenario
where i want to hash some columns of csv file
how to do that with below data
ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22
The Column name NAME | AGE should get hashed with random values
like below output
ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940
When i tested this code below code gives me same value for the column 'NAME' it should give randomized values
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv
output :
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
You may use it like this:
awk 'function hash(s, cmd, hex, line) {
cmd = "openssl md5 <<< \"" s "\""
if ( (cmd | getline line) > 0)
hex = line
close(cmd)
return hex
}
BEGIN {
FS = OFS = "|"
}
NR == 1 {
print
next
}
{
print $1, hash($2), $3, hash($4)
}' file
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6
You have to specify | as input and output field separators. Otherwise $2 is not what you expect, but an empty string.
awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv
prints
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22
Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:
$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9
Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:
$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7 -
$ echo -n AB1 | md5sum
1109867462b2f0f0470df8386036243c -
You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).
Perl has support for md5 out of the box:
perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) {
$F[$_] = md5_hex($F[$_]) for (1,3);
print join "|",#F
} else { print }'
online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)
Digest::MD5 is a core module, any perl installation should have it
-M'Digest::MD5 qw(md5_hex)' - this loads the md5_hex function
-l handle line endings
-F'\|' - autosplit fields on | (this implies -a and -n)
2..eof - range operator (or flip-flop as some want to call it) - true between line 2 and end of the file
$F[$_] = md5_hex($F[$_]) - replace field $_ with it's md5 sum
for (1,3) - statement modifier runs the statement for 1 and 3 aliasing $_ to them
print join "|",#F - print the modified fields
else { print } - this hanldes the header
Note about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)
time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",#F } else { print }' <sample2.txt > out4.txt
real 0m0.121s
user 0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt
real 1m14.205s
user 0m50.405s
sys 0m35.340s
md5.awk defines the md5 function as such:
$ cat md5.awk
function md5(str, cmd, l, hex) {
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
return hex
}
I'm using /bin/echo because there are some variants of shell where echo doesn't have -n
I'm using -n mostly because I want to be able to compare the results with the perl results
substr(l,0,32) - on my machine openssl md5 doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant part
I'm using a separate file because it seems much cleaner, and because I can switch between function implementations fairly easy
As I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl tool.
$ cat md5memo.awk
function md5(str, cmd, l, hex) {
if (cache[str])
return cache[str]
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
cache[str] = hex
return hex
}
With the above caching, the results improve dramatically:
$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt
real 0m0.192s
user 0m0.141s
sys 0m0.085s
[savuso#localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt
real 0m0.281s
user 0m0.222s
sys 0m0.088s
however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!
Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers

awk add string to each line except last blank line

I have file with blank line at the end. I need to add suffix to each line except last blank line.
I use:
awk '$0=$0"suffix"' | sed 's/^suffix$//'
But maybe it can be done without sed?
UPDATE:
I want to skip all lines which contain only '\n' symbol.
EXAMPLE:
I have file test.tsv:
a\tb\t1\n
\t\t\n
c\td\t2\n
\n
I run cat test.tsv | awk '$0=$0"\t2"' | sed 's/^\t2$//':
a\tb\t1\t2\n
\t\t\t2\n
c\td\t2\t2\n
\n
It sounds like this is what you need:
awk 'NR>1{print prev "suffix"} {prev=$0} END{ if (NR) print prev (prev == "" ? "" : "suffix") }' file
The test for NR in the END is to avoid printing a blank line given an empty input file. It's untested, of course, since you didn't provide any sample input/output in your question.
To treat all empty lines the same:
awk '{print $0 (/./ ? "suffix" : "")}' file
#try:
awk 'NF{print $0 "suffix"}' Input_file
this will skip all blank lines
awk 'NF{$0=$0 "suffix"}1' file
to only skip the last line if blank
awk 'NR>1{print p "suffix"} {p=$0} END{print p (NF?"suffix":"") }' file
If perl is okay:
$ cat ip.txt
a b 1
c d 2
$ perl -lpe '$_ .= "\t 2" if !(eof && /^$/)' ip.txt
a b 1 2
2
c d 2 2
$ # no blank line for empty file as well
$ printf '' | perl -lpe '$_ .= "\t 2" if !(eof && /^$/)'
$
-l strips newline from input, adds back when line is printed at end of code due to -p option
eof to check end of file
/^$/ blank line
$_ .= "\t 2" append to input line
Try this -
$ cat f ###Blank line only in the end of file
-11.2
hello
$ awk '{print (/./?$0"suffix":"")}' f
-11.2suffix
hellosuffix
$
OR
$ cat f ####blank line in middle and end of file
-11.2
hello
$ awk -v val=$(wc -l < f) '{print (/./ || NR!=val?$0"suffix":"")}' f
-11.2suffix
suffix
hellosuffix
$

Is it possible to pipe a print statement in awk to multiple text files?

I think the question speaks for itself. I have two text files: file1 and file2. Here is a sample code with awk inside a bash script:
EDIT: I am using gnu awk
My Script:
val=3
awk 'if ("'$val'" == "3")
print "Hello" >> "'$PWD/file1.txt'"
else
print "Goodbye" #append to file1.txt and file2.txt
'
I don't want something like this:
val=3
awk 'if ("'$val'" == "3")
print "Hello" >> "'$PWD/file1.txt'"
else {
print "Goodbye" >> "'$PWD/file1.txt'"
print "Goodbye" >> "'$PWD/file2.txt'"
}'
I know that in bash you can use tee to pipe to multiple files. Can it be used in gnu awk? If so then how? Is there another way in gnu awk?
The GNU Awk manual show an example of how to simulate tee with awk. This might be a good starting point.
The basic idea will be to store the various output file names in an array -- and then to loop over this array to send the output on each file in its turn. In your case something like that (I type directly in SO -- you have to adapt/fix according to your needs, of course):
BEGIN {
output[0] = "'$PWD/file1.txt'"
output[1] = "'$PWD/file1.txt'"
...
}
{
for (i in output)
print "Goodbye!" >> output[i]
}
I know that in bash you can use tee to pipe to multiple files. Can it be used in gnu awk?
If a non-awk only solution is acceptable, an other option will be to redirect some filedescriptor to tee in the outer bash scrip and then send output to that fd from awk when required. Here is a simple example:
#!/bin/bash
exec 4<> >(tee file1.txt file2.txt)
awk '{ print NR; # send only to stdout
print "READ:" $0 >> "/dev/fd/4"; # send to `tee`
}'
That produces:
sh$ (echo a; echo b) | ./a.sh
1
2
READ:a
READ:b
sh$ cat file1.txt
READ:a
READ:b
sh$ cat file2.txt
READ:a
READ:b
Your awk script is wrong in the way it access the value of shell variables and you're putting the whole script in the condition section and so will get undesirable side effects if not syntax errors.
Your script:
val=3
awk 'if ("'$val'" == "3")
print "Hello" >> "'$PWD/file1.txt'"
else
print "Goodbye" #append to file1.txt and file2.txt
'
should instead have been written as:
val=3
awk -v val="$val" -v pwd="${PWD}/" '{
if (val == 3)
print "Hello" >> (pwd "file1.txt")
else
print "Goodbye" #append to file1.txt and file2.txt
}'
to be syntactically correct. To expand it to print to multiple files is:
val=3
awk -v val="$val" -v pwd="${PWD}/" '{
if (val == 3) {
print "Hello" >> (pwd "file1.txt")
}
else {
print "Goodbye" >> (pwd "file1.txt")
print "Goodbye" >> (pwd "file2.txt")
}
}'
Or:
val=3
awk -v val="$val" -v pwd="${PWD}/" '
BEGIN { split("file1 file2",files) }
{
if (val == 3)
print "Hello" >> (pwd "file1.txt")
else
for (f in files)
print "Goodbye" >> (pwd files[f] ".txt")
}'

Linux - parsing data, what language to use

I am looking to parse data out of a 'column' based format. I am running into issues where I feel I am 'hacking' bash/awk commands to pull the strings and numbers. If the numbers/text come in different formats then the script might fail unexpectedly and I will have errors.
Data:
RSSI (dBm): -86 Tx Power: 0
RSRP (dBm): -114 TAC: 4r5t (12341)
RSRQ (dB): -10 Cell ID: efefwg (4261431)
SINR (dB): 2.2
My method:
Using bash and awk
#!/bin/bash
DATA_OUTPUT=$(get_data)
RSSI=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSSI" {print $3}')
RSRP=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSRP" {print $3}')
RSRQ=$(echo "${DATA_OUTPUT}" | awk '$1 == "RSRQ" {print $3}')
SINR=$(echo "${DATA_OUTPUT}" | awk '$1 == "SINR" {print $3}')
TX_POWER=$(echo "${DATA_OUTPUT}" | awk '$4 == "Tx" {print $6}')
echo "$SINR"
echo ">$SINR<"
However the output of the above comes out very strange.
2.2 # thats fine!
<2.2 # what??? expecting >4.6<
Little things like this make me question using awk and bash to parse the data. Should I use C++ or some other language? Or is there a better way of doing this?
Thank you
This should be your starting point (the match() can be simplified or removed if your input data is tab-separated or fixed width fields):
$ cat file
RSSI (dBm): -86 Tx Power: 0
RSRP (dBm): -114 TAC: 4r5t (12341)
RSRQ (dB): -10 Cell ID: efefwg (4261431)
SINR (dB): 2.2
.
$ cat tst.awk
{
tail = $0
while ( match(tail,/[^:]+:[[:space:]]+[^[:space:]]+[[:space:]]*([^[:space:]]*$)?/) )
{
nvPair = substr(tail,RSTART,RLENGTH)
sub(/ \([^)]+\):/,":",nvPair) # remove (dB) or (dBm)
sub(/:[[:space:]]+/,":",nvPair) # remove spaces after :
sub(/[[:space:]]+$/,"",nvPair) # remove trailing spaces
split(nvPair,tmp,/:/)
name2value[tmp[1]] = tmp[2] # name2value["RSSI"] = "-86"
tail = substr(tail,RSTART+RLENGTH)
}
}
END {
for (name in name2value) {
value = name2value[name]
printf "%s=\"%s\"\n", name, value
}
}
.
$ awk -f tst.awk file
Tx Power="0"
RSSI="-86"
TAC="4r5t (12341)"
Cell ID="efefwg (4261431)"
RSRP="-114"
RSRQ="-10"
SINR="2.2"
Hopefully it's clear that in the above script after the match() loop you can simply say things like print name2value["Tx Power"] to print the value of that key phrase.
If your data was created in DOS, run dos2unix or tr -d '^M' on it first, where ^M means a literal control-M character.
Your data contains DOS-style \r\n line endings. When you do this
echo ">$SINR<"
The actual output is actually
>4.6\r<
The carriage return sends the cursor back to the start of the line.
You can do this:
DATA_OUTPUT=$(get_data | sed 's/\r$//')
But instead of parsing the output over and over, I'd rewrite like this:
while read -ra fields; do
case ${fields[0]} in
RSSI) rssi=${fields[2]};;
RSRP) rsrp=${fields[2]};;
RSPQ) rspq=${fields[2]};;
SINR) sinr=${fields[2]};;
esac
if [[ ${fields[3]} == "Tx" ]]; then tx_power=${fields[5]}; fi
done < <(get_data | sed 's/\r$//' )

awk save command ouput to variable

I need to execute a command per line of some file. For example:
file1.txt 100 4
file2.txt 19 8
So my awk script need to execute something like
command $1 $2 $3
and save the output of command $1 $2 $3, so system() will not work and neither will getline. (I can't pipe the output if I do something like this.)
The restriction to this problem is to use only awk. (i already had a solution with bashscriot + awk...but I only want awk...just to know more about this)
What's wrong with using getline?
$ ./test.awk test.txt
# ls -F | grep test
test.awk*
test.txt
# cat test.txt | nl
1 ls -F | grep test
2 cat test.txt | nl
3 cat test.awk
# cat test.awk
#!/usr/bin/awk -f
{
cmd[NR] = $0
while ($0 | getline line) output[NR] = output[NR] line RS
}
END {
for (i in cmd) print "# " cmd[i] ORS output[i]
}
Awk's system() function passes the string to /bin/sh, so you can use redirect operators, like ">file.out" if you want.
awk '{system("command " $1 " " $2 " " $3 ">" $1 ".out");}'
Edit: ok, by save, you mean into an awk variable. ephemient is on the right track, then. That's what awk's getline does, like backticks or $(cmd) in shell/perl. In fact, google for awk backticks found this:
http://www.omnigroup.com/mailman/archive/macosx-admin/2006-May/054665.html
You say you can't use getline because then you couldn't pipe. But you can work around that with tee and file-descriptor tricks. This works if /bin/sh is bash:
{ "set +o posix; command " $1 " " $2 " " $3 " | tee >(grep foo)" | getline var; print toupper(var); } # bash-only, and broken.
set +o posix is necessary because awk runs bash as sh, which makes it go into posix mode after readings its startup files. Hmm, I'm not having any luck getting that to work, and it requires bash anyway.
Ok, this works:
$ touch foo bar
$ echo "foo bar" |
awk '{ "{ ls " $1 " " $2 " " $3 " | tee /dev/fd/10 | grep foo > /dev/tty; } 10>&1" | getline var; print toupper(var); }'
foo
BAR

Resources