Extract part of one column and save into another file using awk - linux

I have a requirement to extract fields from a csv file. There are two columns billing_info and key_id. billing_info is a object which has multiple data items in curly braces. I need to extract billing_info.id_encrypted, key_id into a different file.
input.csv
billing_info,key_id
{id: '1B82', id_encrypted: '1Q4AW5bwyU', address: 'san jose', phone: '13423', country: 'v73jyqgE='},bf6-96f751
output.csv
billing_info.id_encrypted,key_id
1Q4AW5bwyU,bf6-96f751
May i know how to use awk command to extract the data in format mentioned in output.csv. Please help

Making some assumptions:
the first line of input lists the column names
the brace-delimited element contains an arbitrary number of comma-separated key-value pairs
key-value pairs can appear in an arbitrary order
values are delimited by single-quotes
commas cannot appear inside keys or values
single-quotes do not appear anywhere else
<csvfile | awk -F, '
BEGIN {
getline
print "billing_info.id_encrypted,key_id"
}
{
for (i=1; i<NF; i++)
if ($i ~ /id_encrypted/)
split($i, e, /\047/)
print e[2] "," $NF
}
'
Notes:
-F, splits input lines into comma-separated fields
BEGIN section handles the header
we output the header even if there is no input
for loop runs through all the fields (except the final one)
($i ~ /id_encrypted/) looks for any that contain the key word
split splits that field on single-quotes (/\047/)
print outputs the value found, and the final field

Here is a fast and elegant solution using awk:
awk -F ":" '{split($3,arr1,",");split($6,arr2,",");print arr1[1] "," arr2[2]}' input.csv > output.csv
With an explanation:
-F ":" make the awk field separator :
split($3,arr1,",") split the 3rd field by the ,into array having 2 elements.
split($6,arr2,",") split the 6th field by the ,into array having 2 elements.
Then print out the first element in arr1 and the second element in arr2.

I recommend you just convert your whole input to CSV and THEN you can trivially extract whatever fields you like from it using awk or Excel or any other tool, e.g.:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
split($0,hdr)
next
}
{
fld[1] = fld[2] = $0
sub(/,[^,]*$/,"",fld[1])
gsub(/^{|}$/,"",fld[1])
sub(/.*,/,"",fld[2])
# print "trace: " hdr[1] "=<" fld[1] ">" | "cat>&2"
# print "trace: " hdr[2] "=<" fld[2] ">" | "cat>&2"
numTags = split(fld[1],tags,/'[^']*'/,vals)
delete tags[numTags--]
for (tagNr=1; tagNr<=numTags; tagNr++) {
gsub(/^, *|: *$/,"",tags[tagNr])
gsub(/^'|'$/,"",vals[tagNr])
# print "trace: " tagNr ": <" tags[tagNr] "=" vals[tagNr] ">" | "cat>&2"
}
}
FNR == 2 {
for (tagNr=1; tagNr<=numTags; tagNr++) {
printf "%s.%s%s", hdr[1], tags[tagNr], OFS
}
print hdr[2]
}
{
for (tagNr=1; tagNr<=numTags; tagNr++) {
printf "\"%s\"%s", vals[tagNr], OFS
}
printf "\"%s\"%s", fld[2], ORS
}
.
$ awk -f tst.awk file
billing_info.id,billing_info.id_encrypted,billing_info.address,billing_info.phone,billing_info.country,key_id
"1B82","1Q4AW5bwyU","san jose","13423","v73jyqgE=","bf6-96f751"
The above uses GNU awk for the 4th arg to split(). Uncomment the print trace lines to see what each step is doing if you like. You don't need to add the double quotes around each output field if you remove or replace any commas within each field (esp. the address).

Related

Remove non matching values in csv column

I need to validate and clean a field in CSV. There is column for IP address and I need to remove only invalid data inside that column.
I tried the following command :
awk 'BEGIN{ FS=OFS="," }{ gsub(/^([0-9]{1,3}[\.]){3}[0-9]{1,3}$/,"", $3) }1' input.csv
Input file
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
Current output
anna,new york,,usa
james,denver,,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
Expected output
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
This command remove the matching data, but I need the opposite. How do I remove only the non-matching data in the IP column ?
How do I remove only the non-matching data in the IP column ?
You might combine following string functions: match substr for this task following way
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
then
awk 'BEGIN{FS=OFS=","}{$3=match($3,/([0-9]{1,3}[\.]){3}[0-9]{1,3}/)?substr($3,RSTART,RLENGTH):"";print}' file.txt
gives output
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for each line I use so called ternary operator condition?valueiftrue:valueiffalse, condition is if $3 does match regular expression, observe that I altered it slightly, so it does hold if IP is somewhere inside, rather than span whole column. If match found I use substr to get substring which does correspond to match using RSTART, RLENGTH which were set by match, otherwise I use empty string. After that I print whole line.
(tested in gawk 4.2.1)
If your CSV is as simple as what you show (one line per record, no commas inside fields, no quoted fields, no leading or trailing spaces in fields...), and after removing the male in 10.2.8.3 male (is it a typo?), you could try:
$ awk -F, -v OFS=, '$3 !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/ {$3 = ""} {print}' input.csv
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
And if you want to check that the 3rd field is really a valid full IP address (no subnets):
$ cat filter.awk
function isIP(v) {
if(v !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/)
return 0;
split(v, a, /\./)
for(i = 1; i <= 4 ; i++) {
if(a[i] > 255) {
return 0;
}
}
return 1
}
BEGIN { FS = ","; OFS = "," }
! isIP($3) {$3 = ""}
{print}
$ cat input.csv
bob,LA,292.168.1.5,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3,usa
$ awk -f filter.awk input.csv
bob,LA,,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa

match between two files and merge the output using awk

I have two files.First column is common between the both files and I would like to merge the file and generate the output where its copy the first file third column every time in second file whenever there is match.
file1
412234;mark
413234;raja
file2
412234;value1
412234;value2
412234;value3
412234;value4
413234;value1
413234;value2
413234;value3
Output file
412234;value1;mark
412234;value2;mark
412234;value3;mark
412234;value4;mark
413234;value1;raja
413234;value2;raja
413234;value3;raja
Try this:
awk -F';' 'BEGIN{FS=OFS=";"} FNR==NR{a[$1]=$2; next} ($1 in a){print $1, $2, a[$1]}' file1 file2
explanation:
-F';' means that AWK will use ; as field separator;
BEGIN{FS=OFS=";"} set the Output filed separator, used by print function;
AWK parse all files sequentially, the condition:
FNR==NR
is true only when parsing the first file.
While parsing file1, it saves a vector a with first match as index and second match as value;
a is expected to be
a[412234] = mark
a[413234] = raja
($1 in a) is the condition to met, true when first match on file2 is found on vector a.
If true then execute:
print $1";"$2";"a[$1]
that prints matches from file2 and the value of the vector a, saved from file1
----- EDIT
In case file1 contains multiple lines with same index, you need to save all distinct values in a vector and then scan the whole vector for multiple matches on file2
awk -F';' ' \
function vlen(a){n=0; for(i in a) n++; return n;} # helper function defined here \
function contained(val, vect) {found =0; for (x in vect) { if(vect[x] == val) found=1}; return found} # helper function defined here \
BEGIN{FS=OFS=";"} # Set output field separator \
FNR==NR{n=vlen(a); a[n]=$1; b[n]=$2; next} # scan file1 and save all indexes and value in different vectors \
{if(contained($1,a)) { for (i in a) { if (a[i] == $1) { print $1, $2, b[i]}} } else { print $1, $2 } } # for each line in file2, scan the whole vector a looking for a match \
' file1 file2
here we are defining the vlen and contained helper functions
Would you try the following:
awk '
BEGIN {FS=OFS=";"}
NR==FNR {
c[$1]++
a[$1,c[$1]]=$2
next
}
{
if (c[$1]) {
for (i=1; i<=c[$1]; i++) {
$3=a[$1,i]; print
}
} else {
print
}
}' file1 file2
Result with the file1 and file2 provided in the OP's last comment:
412234;value1;mark
412234;value1;raja
412234;value2;mark
412234;value2;raja
413234;value1
413234;value2
If the index in the 1st column (such as 412234) appears more than once
in file1, we need to preserve the existing value in the 2nd column
(such as mark) without overwriting.
Then an array c is introduced to count the occurrences of the index.
Note that the order of the result differs from the OP's expected output.
I hope it is acceptable.

Filtering CSV file based on string name

I'm trying to get specific columns of a csv file (that Header contains "SOF" in case). Is a large file and i need to copy this columns to another csv file using Shell.
I've tried something like this:
#!/bin/bash
awk ' {
i=1
j=1
while ( NR==1 )
if ( "$i" ~ /SOF/ )
then
array[j] = $i
$j += 1
fi
$i += 1
for ( k in array )
print array[k]
}' fil1.csv > result.csv
In this case i've tried to save the column numbers that contains "SOF" in the header in an array. After that copy the columns using this numbers.
Preliminary note: contrary to what one may infer from the code included in the OP, the values in the CSV are delimited with a semicolon.
Here is a solution with two separate commands:
the first parses the first line of your CSV file and identifies which fields must be exported. I use awk for this.
the second only prints the fields. I use cut for this (simpler syntax and quicker than awk, especially if your file is large)
The idea is that the first command yields a list of field numbers, separated with ",", suited to be passed as parameter to cut:
# Command #1: identify fields
fields=$(awk -F";" '
{
for (i = 1; i <= NF; i++)
if ($i ~ /SOF/) {
fields = fields sep i
sep = ","
}
print fields
exit
}' fil1.csv
)
# Command #2: export fields
{ [ -n "$fields" ] && cut -d";" -f "$fields" fil1.csv; } > result.csv
try something like this...
$ awk 'BEGIN {FS=OFS=","}
NR==1 {for(i=1;i<=NF;i++) if($i~/SOF/) {col=i; break}}
{print $col}' file
there is no handling if the sought out header doesn't exist so should print the whole line.
This link might be helpful for you :
One of the useful commands you probably need is "cut"
cut -d , -f 2 input.csv
Here number 2 is the column number you want to cut from your csv file.
try this one out :
awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }' filename | grep SOF | awk '{for(i=1;i<=NF;i++)a[i]=a[i]" "$i}END{for (i in a ){ print a[i] } }'

How to Compare CSV Column using awk?

I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.

Check variables from different lines with awk

I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?
awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.
does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.
Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file
Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)
this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z

Resources