I have a scenario
where i want to hash some columns of csv file
how to do that with below data
ID|NAME|CITY|AGE
1|AB1|BBC|12
2|AB2|FGD|17
3|AB3|ASD|18
4|AB4|SDF|19
5|AB5|ASC|22
The Column name NAME | AGE should get hashed with random values
like below output
ID|NAME|CITY|AGE
1|68b329da9111314099c7d8ad5cb9c940|BBC|77bAD9da9893er34099c7d8ad5cb9c940
2|69b32fga9893e34099c7d8ad5cb9c940|FGD|68bAD9da989yue34099c7d8ad5cb9c940
3|46b329da9893e3403453d8ad5cb9c940|ASD|60bfgD9da9893e34099c7d8ad5cb9c940
4|50Cd29da9893e34099c7d8ad5cb9c940|SDF|67bAD9da98973e34099c7d8ad5cb9c940
5|67bAD9da9893e34099c7d8ad5cb9c940|ASC|67bAD9da11893e34099c7d8ad5cb9c940
When i tested this code below code gives me same value for the column 'NAME' it should give randomized values
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample.csv
output :
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
68b329da9893e34099c7d8ad5cb9c940
You may use it like this:
awk 'function hash(s, cmd, hex, line) {
cmd = "openssl md5 <<< \"" s "\""
if ( (cmd | getline line) > 0)
hex = line
close(cmd)
return hex
}
BEGIN {
FS = OFS = "|"
}
NR == 1 {
print
next
}
{
print $1, hash($2), $3, hash($4)
}' file
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|2737b49252e2a4c0fe4c342e92b13285
2|157aa4a48373eaf0415ea4229b3d4421|FGD|4d095eeac8ed659b1ce69dcef32ed0dc
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|cf4278314ef8e4b996e1b798d8eb92cf
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|3bb50ff8eeb7ad116724b56a820139fa
5|427872b1ac3a22dc154688ddc2050516|ASC|2fc57d6f63a9ee7e2f21a26fa522e3b6
You have to specify | as input and output field separators. Otherwise $2 is not what you expect, but an empty string.
awk -F '|' -v "OFS=|" 'FNR==1 { print; next } {
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' sample.csv
prints
ID|NAME|CITY|AGE
1|d44aec35a11ff6fa8a800120dbef1cd7|BBC|12
2|157aa4a48373eaf0415ea4229b3d4421|FGD|17
3|ba3c08d4a65f1baa1d7220a6802b5710|ASD|18
4|69be622e1c0d417ceb9b8fb0aa9dc574|SDF|19
5|427872b1ac3a22dc154688ddc2050516|ASC|22
Example using GNU datamash to do the hashing and some awk to rearrange the columns it outputs:
$ datamash -t'|' --header-in -f md5 2,4 < input.txt | awk 'BEGIN { FS=OFS="|"; print "ID|NAME|CITY|AGE" } { print $1, $5, $3, $6 }'
ID|NAME|CITY|AGE
1|1109867462b2f0f0470df8386036243c|BBC|c20ad4d76fe97759aa27a0c99bff6710
2|14da3a611e2f8953d76b6fb7866b01d1|FGD|70efdf2ec9b086079795c442636b55fb
3|710a24b9eac0692b1adaabd07726211a|ASD|6f4922f45568161a8cdf4ad2299f6d23
4|c4d15b255ef3c6a89d1fe2e6a26b8eda|SDF|1f0e3dad99908345f7439f8ffabdffc4
5|96b24a28173a75cc3c682e25d3a6bd49|ASC|b6d767d2f8ed5d21a44b0e5886680cb9
Note that the MD5 hashes are different in this answer than (At the time of writing) the ones in the others; that's because they use approaches that add a trailing newline to the strings being hashed, producing incorrect results if you want the exact hash:
$ echo AB1 | md5sum
d44aec35a11ff6fa8a800120dbef1cd7 -
$ echo -n AB1 | md5sum
1109867462b2f0f0470df8386036243c -
You might consider using a language that has support for md5 included, or at least cache the md5 results (I assume that the city and age have a limited domain, which is smaller than the number of lines).
Perl has support for md5 out of the box:
perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) {
$F[$_] = md5_hex($F[$_]) for (1,3);
print join "|",#F
} else { print }'
online demo: https://ideone.com/xg6cxZ (to my surprise ideone has perl available in bash)
Digest::MD5 is a core module, any perl installation should have it
-M'Digest::MD5 qw(md5_hex)' - this loads the md5_hex function
-l handle line endings
-F'\|' - autosplit fields on | (this implies -a and -n)
2..eof - range operator (or flip-flop as some want to call it) - true between line 2 and end of the file
$F[$_] = md5_hex($F[$_]) - replace field $_ with it's md5 sum
for (1,3) - statement modifier runs the statement for 1 and 3 aliasing $_ to them
print join "|",#F - print the modified fields
else { print } - this hanldes the header
Note about speed: on my machine this processes ~100,000 lines in about 100 ms, compared with an awk variant of this answer that does 5000 lines in ~1 minute 14 seconds (i wasn't patient enough to wait for 100,000 lines)
time perl -M'Digest::MD5 qw(md5_hex)' -F'\|' -le 'if (2..eof) { $F[$_] = md5_hex($F[$_]) for (1,3);print join "|",#F } else { print }' <sample2.txt > out4.txt
real 0m0.121s
user 0m0.118s
sys 0m0.003s
$ time awk -F'|' -v OFS='|' -i md5.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >out2.txt
real 1m14.205s
user 0m50.405s
sys 0m35.340s
md5.awk defines the md5 function as such:
$ cat md5.awk
function md5(str, cmd, l, hex) {
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
return hex
}
I'm using /bin/echo because there are some variants of shell where echo doesn't have -n
I'm using -n mostly because I want to be able to compare the results with the perl results
substr(l,0,32) - on my machine openssl md5 doesn't return just the sum, it has also the file name - see: https://ideone.com/KGMWPe - substr gets only the relevant part
I'm using a separate file because it seems much cleaner, and because I can switch between function implementations fairly easy
As I was saying in the beginning, if you really want to use awk, at least cache the result of the openssl tool.
$ cat md5memo.awk
function md5(str, cmd, l, hex) {
if (cache[str])
return cache[str]
cmd= "/bin/echo -n "str" | openssl md5 -r"
if ( ( cmd | getline l) > 0 )
hex = substr(l,0,32)
close(cmd)
cache[str] = hex
return hex
}
With the above caching, the results improve dramatically:
$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <(head -5000 sample2.txt) >outmemo.txt
real 0m0.192s
user 0m0.141s
sys 0m0.085s
[savuso#localhost hash]$ time awk -F'|' -v OFS='|' -i md5memo.awk '{ print $1,md5($2),$3,md5($4) }' <sample2.txt >outmemof.txt
real 0m0.281s
user 0m0.222s
sys 0m0.088s
however your mileage my vary: sample2.txt has 100000 lines, with 5 different values for $2 and 40 different values for $4. Real life data may vary!
Note: I just realized that my awk implementation doesn't handle headers, but you can get that from the other answers
I have a Perl script that accepts a comma separated csv file as input.
I would like to discard the last column (the column number is known in advance).
The problem is that the last column may contain quoted strings with commas, in which case I would like to cut the entire string.
Example:
colA,colB,colC
1,2,3
4,5,"6,6"
What I would like to end up with is:
colA,colB
1,2
4,5
The current solution I have is using Linux cut command in the following manner:
cat $file | cut -d ',' -f 3 --complement
Which outputs the following:
colA,colB
1,2
4,5,6"
Which works great unless the last column is a quoted string with commas in it.
I can only use native Perl/Linux commands to solve this.
Appreciate your help
Using Text::CSV, as a script to process STDIN into STDOUT:
use strict;
use warnings;
use Text::CSV 'csv';
my $csv = csv(in => \*STDIN, keep_headers => \my #headers,
auto_diag => 2, encoding => 'UTF-8');
pop #headers;
csv(in => $csv, out => \*STDOUT, headers => \#headers,
auto_diag => 2, encoding => 'UTF-8');
The obvious benefit of this approach is handling all common edge cases automatically.
Try this based on awk-regex:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}' ${file}
Example
echo '"4,4",5,"6,6"' | awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
"4,4",5
Reference
If quoted strings with comma is the only trouble you are facing, you can use this:
$ sed -E 's/,"[^"]*"$|,[^,]*$//' ip.txt
colA,colB
1,2
4,5
,"[^"]*"$ will match , followed by " followed by non " characters followed by " at the end of line
,[^,]*$ will match , followed by non , characters at end of line
The double quoted column will match earlier in the string and thus gets deleted completely
Equivalent for perl would be perl -lpe 's/,"[^"]*"$|,[^,]*$//' ip.txt
I believe sungtm answer is correct and requries some explanation:
awk -v FPAT='([^,]+)|(\"[^\"]+\")' -v OFS=',' '{print $1,$2}'
Is equivalent to:
script.awk
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"; # gnu awk specific: FPAT is RegEx pattern to identify the field's content
# [^,]+ ------ RegEx pattern to match all chars not ","
#"[^\"]+\" ------ RegEx pattern to match all quated chars including the quotes
#()|() ------ RegEx optional groups selector
OFS = ","; # Output field separator
}
{ # for each input line/record
print $1, $2; # print "1st field" OFS value "2nd field"
}
Runnig
awk -f scirpt.awk input.txt
Save the script in any file say script.pl
Execute as prompt>perl script.pl /opt/filename.csv
"1","2,3",4,"test, test" ==> "1","2,3",4
1,"2,3,4","5 , 6","7,8" ==> 1,"2,3,4","5 , 6"
0,0,0,"test" ==> 0,0,0
Handles above cases
use strict;
if (scalar(#ARGV) != 1 ) {
print "usage: perl script.pl absolute_file_path";
exit;
}
my $filename = $ARGV[0]; # complete file path here
open(DATA, '<', $filename)
or die "Could not open file '$filename' $!";
my #lines = <DATA>;
close(DATA);
my $counter=0;
open my $fo, '>', $filename;
foreach my $line(#lines) {
chomp($line);
my #update = split '(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)' , $line;
my #update2;
foreach (#update) {
if($_=~/\w+/) {
push(#update2,$_);
}
}
pop(#update2);
print #update2;
my $str = join(',',#update2);
print $fo "$str";
unless (++$counter == scalar(#lines)) {
print $fo "\n";
}
}
close $fo;
Well this case is quite interesting - please see my solution bellow.
You can change $debug = 1; to see what happens and how this mechanism works
use strict;
use warnings;
my $debug = 0;
while( <DATA> ) {
print "IN: $_" if $debug;
chomp;
s/"(.+?)"/replace($1)/ge; # do magic replacement , -> ___ in block of interest
print "REP: $_\n" if $debug;
my #data = split /,/; # split into array
pop #data; # pop last element of array
my $line = join ',', #data; # merge array into a string
$line =~ s/___/,/g; # do unmagic replacement
$line =~ s/\|/"/g; # restore | -> "
printf "%s$line\n", $debug ? "OUT: " : ''; # print result
}
sub replace {
my $line = shift;
$line =~ s/,/___/g; # do magic replacement in our block
return "|$line|"; # put | arount block of interest
}
__DATA__
colA,colB,colC
1,2,3
4,5,"6,6"
8,3,"1,2",37,82
64,12,"1,2,3,4",42,56
"3,4,7,8",2,8,"8,7,6,5,4",2,8
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1"
"3,4,7,8",2,8,"8,7,6,5,4",2,8,"2,8,4,1",3,4
Appreciate your help. Below is the solution I ended up using:
cat file.csv | perl -MText::ParseWords -nle '#f = parse_line(",",2, $_); tr/,/$/d for #f; print join ",", #f' | cut -d ',' -f 3 --complement | tr $ , ;
This will replace commas in field surrounded by quotes to the $ sign, to re replaced back after discarding the last unwanted column.
Say I have a string:
random text before authentication_token = 'pYWastSemJrMqwJycZPZ', gravatar_hash = 'd74a97f
I want a shell command to extract everything after "authentication_token = '" and before the next '.
So basically, I want to return pYWastSemJrMqwJycZPZ.
How do I do this?
Use parameter expansion:
#!/bin/bash
text="random text before authentication_token = 'pYWastSemJrMqwJycZPZ', gravatar_hash = 'd74a97f"
token=${text##* authentication_token = \'} # Remove the left part.
token=${token%%\'*} # Remove the right part.
echo "$token"
Note that it works even if random text contains authentication token = '...'.
If your grep supports -P then you could use this PCRE regex,
$ echo "random text before authentication_token = 'pYWastSemJrMqwJycZPZ', gravatar_hash = 'd74a97f" | grep -oP "authentication_token = '\K[^']*"
pYWastSemJrMqwJycZPZ
$ echo "random text before authentication_token = 'pYWastSemJrMqwJycZPZ', gravatar_hash = 'd74a97f" | grep -oP "authentication_token = '\K[^']*(?=')"
pYWastSemJrMqwJycZPZ
\K discards previously matched characters from printing at the final.
[^']* negated character class which matches any character but not of ' zero or more times.
(?=') Positive lookahead which asserts that the match must be followed by a single quote.
My simple version is
sed -r "s/(.*authentication_token = ')([^']*)(.*)/\2/"
IMO, grep -oP is the best solution. For completeness, a couple of alternatives:
sed 's/.*authentication_token = '\''//; s/'\''.*//' <<<"$string"
awk -F "'" '{for (i=1; i<NF; i+=2) if ($1 ~ /authentication_token = $/) {print $(i+1); break}}' <<< "$string"
Use bash's regular expression matching facilities.
$ regex="_token = '([^']+)'"
$ string="random text before authentication_token = 'pYWastSemJrMqwJycZPZ', gravatar_hash = 'd74a97f'"
$ [[ $string =~ $regex ]] && hash=${BASH_REMATCH[1]}
$ echo "$hash"
pYWastSemJrMqwJycZPZ
Using a variable in place of a literal regular expression simplifies quoting the spaces and single quotes.
I'm looking at files that all have a different version number that starts at column 18 of line 7.
What's the best way with Bash to read (into a $variable) the string on line 7, from column, i.e. "character," 18 to the end of the line? What about to the 5th to last character of the line?
sed way:
variable=$(sed -n '7s/^.\{17\}//p' file)
EDIT (thanks to commenters): If by columns you mean fields (separated with tabs or spaces), the command can be changed to
variable=$(sed -n '7s/^\(\s\+\S\+\)\{17\}//p' file)
You have a number of different ways you can go about this, depending on the utilities you want to use. One of your options is to make use of Bash's substring expansion in any of the following ways:
sed
line=1
string=$(sed -n "${line}p" /etc/passwd)
echo "${string:17}"
awk
line=1
string=$(awk "NR==${line} {print}; {next}" /etc/passwd)
echo "${string:17}"
coreutils
line=1
string=`{ head -n $line | tail -n1; } < /etc/passwd`
echo "${string:17}"
Use
var=$(head -n 17 filename | tail -n 1 | cut -f 18-)
or
var=$(awk 'NR == 17' {delim = ""; for (i = 18; i <= NF; i++) {printf "%s%s", delim, $i; delim = OFS}; printf "\n"}')
If you mean "characters" instead of "fields":
var=$(head -n 17 filename | tail -n 1 | cut -c 18-)
or
var=$(awk 'NR == 17' {print substr($0, 18)}')
If by 'columns' you mean 'fields':
a=$( awk 'NR==7{ print $18 }' file )
If you really want the 18th byte through the end of line 7, do:
a=$( sed -n 7p | cut -b 18- )
Is there an inbuilt command to do this or has anyone had any luck with a script that does it?
I am looking to get counts of how many lines had how many occurrences of a specfic character. (sorted descending by the number of occurrences)
For example, with this sample file:
gkdjpgfdpgdp
fdkj
pgdppp
ppp
gfjkl
Suggested input (for the 'p' character)
bash/perl some_script_name "p" samplefile
Desired output:
occs count
4 1
3 2
0 2
Update:
How would you write a solution that worked off a 2 character string such as 'gd' not a just a specific character such as p?
$ sed 's/[^p]//g' input.txt | awk '{print length}' | sort -nr | uniq -c | awk 'BEGIN{print "occs", "count"}{print $2,$1}' | column -t
occs count
4 1
3 2
0 2
You could give the desired character as the field separator for awk, and do this:
awk -F 'p' '{ print NF-1 }' |
sort -k1nr |
uniq -c |
awk -v OFS="\t" 'BEGIN { print "occs", "count" } { print $2, $1 }'
For your sample data, it produces:
occs count
4 1
3 2
0 2
If you want to count occurrences of multi-character strings, just give the desired string as the separator, e.g., awk -F 'gd' ... or awk -F 'pp' ....
#!/usr/bin/env perl
use strict; use warnings;
my $seq = shift #ARGV;
die unless defined $seq;
my %freq;
while ( my $line = <> ) {
last unless $line =~ /\S/;
my $occurances = () = $line =~ /(\Q$seq\E)/g;
$freq{ $occurances } += 1;
}
for my $occurances ( sort { $b <=> $a} keys %freq ) {
print "$occurances:\t$freq{$occurances}\n";
}
If you want short, you can always use:
#!/usr/bin/env perl
$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>
;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f;
or, perl -e '$x=shift;/\S/&&++$f{$a=()=/(\Q$x\E)/g}while<>;print"$_:\t$f{$_}\n"for sort{$b<=>$a}keys%f' inputfile, but now I am getting silly.
Pure Bash:
declare -a count
while read ; do
cnt=${REPLY//[^p]/} # remove non-p characters
((count[${#cnt}]++)) # use length as array index
done < "$infile"
for idx in ${!count[*]} # iterate over existing indices
do echo -e "$idx ${count[idx]}"
done | sort -nr
Output as desired:
4 1
3 2
0 2
Can to it in one gawk process (well, with a sort coprocess)
gawk -F p -v OFS='\t' '
{ count[NF-1]++ }
END {
print "occs", "count"
coproc = "sort -rn"
for (n in count)
print n, count[n] |& coproc
close(coproc, "to")
while ((coproc |& getline) > 0)
print
close(coproc)
}
'
Shortest solution so far:
perl -nE'say tr/p//' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
For multiple characters, use a regex pattern:
perl -ple'$_ = () = /pg/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
This one handles overlapping matches (e.g. it finds 3 "pp" in "pppp" instead of 2):
perl -ple'$_ = () = /(?=pp)/g' | sort -nr | uniq -c |
awk 'BEGIN{print "occs","count"}{print $2,$1}' |
column -t
Original cryptic but short pure-Perl version:
perl -nE'
++$c{ () = /pg/g };
}{
say "occs\tcount";
say "$_\t$c{$_}" for sort { $b <=> $a } keys %c;
'