Related
I need to validate and clean a field in CSV. There is column for IP address and I need to remove only invalid data inside that column.
I tried the following command :
awk 'BEGIN{ FS=OFS="," }{ gsub(/^([0-9]{1,3}[\.]){3}[0-9]{1,3}$/,"", $3) }1' input.csv
Input file
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
Current output
anna,new york,,usa
james,denver,,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
Expected output
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
This command remove the matching data, but I need the opposite. How do I remove only the non-matching data in the IP column ?
How do I remove only the non-matching data in the IP column ?
You might combine following string functions: match substr for this task following way
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
then
awk 'BEGIN{FS=OFS=","}{$3=match($3,/([0-9]{1,3}[\.]){3}[0-9]{1,3}/)?substr($3,RSTART,RLENGTH):"";print}' file.txt
gives output
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for each line I use so called ternary operator condition?valueiftrue:valueiffalse, condition is if $3 does match regular expression, observe that I altered it slightly, so it does hold if IP is somewhere inside, rather than span whole column. If match found I use substr to get substring which does correspond to match using RSTART, RLENGTH which were set by match, otherwise I use empty string. After that I print whole line.
(tested in gawk 4.2.1)
If your CSV is as simple as what you show (one line per record, no commas inside fields, no quoted fields, no leading or trailing spaces in fields...), and after removing the male in 10.2.8.3 male (is it a typo?), you could try:
$ awk -F, -v OFS=, '$3 !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/ {$3 = ""} {print}' input.csv
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
And if you want to check that the 3rd field is really a valid full IP address (no subnets):
$ cat filter.awk
function isIP(v) {
if(v !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/)
return 0;
split(v, a, /\./)
for(i = 1; i <= 4 ; i++) {
if(a[i] > 255) {
return 0;
}
}
return 1
}
BEGIN { FS = ","; OFS = "," }
! isIP($3) {$3 = ""}
{print}
$ cat input.csv
bob,LA,292.168.1.5,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3,usa
$ awk -f filter.awk input.csv
bob,LA,,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
I have a requirement to extract fields from a csv file. There are two columns billing_info and key_id. billing_info is a object which has multiple data items in curly braces. I need to extract billing_info.id_encrypted, key_id into a different file.
input.csv
billing_info,key_id
{id: '1B82', id_encrypted: '1Q4AW5bwyU', address: 'san jose', phone: '13423', country: 'v73jyqgE='},bf6-96f751
output.csv
billing_info.id_encrypted,key_id
1Q4AW5bwyU,bf6-96f751
May i know how to use awk command to extract the data in format mentioned in output.csv. Please help
Making some assumptions:
the first line of input lists the column names
the brace-delimited element contains an arbitrary number of comma-separated key-value pairs
key-value pairs can appear in an arbitrary order
values are delimited by single-quotes
commas cannot appear inside keys or values
single-quotes do not appear anywhere else
<csvfile | awk -F, '
BEGIN {
getline
print "billing_info.id_encrypted,key_id"
}
{
for (i=1; i<NF; i++)
if ($i ~ /id_encrypted/)
split($i, e, /\047/)
print e[2] "," $NF
}
'
Notes:
-F, splits input lines into comma-separated fields
BEGIN section handles the header
we output the header even if there is no input
for loop runs through all the fields (except the final one)
($i ~ /id_encrypted/) looks for any that contain the key word
split splits that field on single-quotes (/\047/)
print outputs the value found, and the final field
Here is a fast and elegant solution using awk:
awk -F ":" '{split($3,arr1,",");split($6,arr2,",");print arr1[1] "," arr2[2]}' input.csv > output.csv
With an explanation:
-F ":" make the awk field separator :
split($3,arr1,",") split the 3rd field by the ,into array having 2 elements.
split($6,arr2,",") split the 6th field by the ,into array having 2 elements.
Then print out the first element in arr1 and the second element in arr2.
I recommend you just convert your whole input to CSV and THEN you can trivially extract whatever fields you like from it using awk or Excel or any other tool, e.g.:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
split($0,hdr)
next
}
{
fld[1] = fld[2] = $0
sub(/,[^,]*$/,"",fld[1])
gsub(/^{|}$/,"",fld[1])
sub(/.*,/,"",fld[2])
# print "trace: " hdr[1] "=<" fld[1] ">" | "cat>&2"
# print "trace: " hdr[2] "=<" fld[2] ">" | "cat>&2"
numTags = split(fld[1],tags,/'[^']*'/,vals)
delete tags[numTags--]
for (tagNr=1; tagNr<=numTags; tagNr++) {
gsub(/^, *|: *$/,"",tags[tagNr])
gsub(/^'|'$/,"",vals[tagNr])
# print "trace: " tagNr ": <" tags[tagNr] "=" vals[tagNr] ">" | "cat>&2"
}
}
FNR == 2 {
for (tagNr=1; tagNr<=numTags; tagNr++) {
printf "%s.%s%s", hdr[1], tags[tagNr], OFS
}
print hdr[2]
}
{
for (tagNr=1; tagNr<=numTags; tagNr++) {
printf "\"%s\"%s", vals[tagNr], OFS
}
printf "\"%s\"%s", fld[2], ORS
}
.
$ awk -f tst.awk file
billing_info.id,billing_info.id_encrypted,billing_info.address,billing_info.phone,billing_info.country,key_id
"1B82","1Q4AW5bwyU","san jose","13423","v73jyqgE=","bf6-96f751"
The above uses GNU awk for the 4th arg to split(). Uncomment the print trace lines to see what each step is doing if you like. You don't need to add the double quotes around each output field if you remove or replace any commas within each field (esp. the address).
I have 2 files. Basically i want to match the column names from File 1 with the column name listed in the File 2. The resulting output File should have data for the column that matches with File 2 and Null value for the remaining column name in File 2.
Example:
file1
Name|Phone_Number|Location|Email
Jim|032131|xyz|xyz#qqq.com
Tim|037903|zzz|zzz#qqq.com
Pim|039141|xxz|xxz#qqq.com
File2
Location
Name
Age
Based on these 2 files, I want to create new file which has data in the below format:
Output:
Location|Name|Age
xyz|Jim|Null
zzz|Tim|Null
xxz|Pim|Null
Is there a way to get this result using join, awk or sed. I tried with join but couldnt get it working.
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==FNR { names[++numNames] = $0; next }
FNR==1 {
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
printf "%s%s", name, (nameNr<numNames?OFS:ORS)
}
for (i=1;i<=NF;i++) {
name2fldNr[$i] = i
}
next
}
{
for (nameNr=1;nameNr<=numNames;nameNr++) {
name = names[nameNr]
fldNr = name2fldNr[name]
printf "%s%s", (fldNr?$fldNr:"Null"), (nameNr<numNames?OFS:ORS)
}
}
$ awk -f tst.awk file2 file1
Location|Name|Age
xyz|Jim|Null
zzz|Tim|Null
xxz|Pim|Null
Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
I'd suggest using csvcut, which is part of CSVKit (https://csvkit.readthedocs.org), along the lines of the following:
#!/bin/bash
HEADERS=File2
PSV=File1
headers=$(tr '\n' , < "$HEADERS" | sed 's/,$//' )
awk '-F|' '
BEGIN {OFS=FS}
NR==1 {print $0,"Age"; next}
{print $0, "Null"}' "$PSV" ) |\
csvcut "-d|" -c "$headers"
I realize this may not be entirely satisfactory, but csvcut doesn't currently have options to handle missing columns or translate missing data to a specified value.
If I have 3 csv files, and I want to merge the data all into one, but beside each other, how would I do it? For example:
Initial Merged file:
,,,,,,,,,,,,
File 1:
20,09/05,5694
20,09/06,3234
20,09/08,2342
File 2:
20,09/05,2341
20,09/06,2334
20,09/09,342
File 3:
20,09/05,1231
20,09/08,3452
20,09/10,2345
20,09/11,372
Final merged File:
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
Basically data from each file goes into a specific column of the merged file.
I know the awk function can be used for this, but I have no clue how to start
EDIT: Only the 2nd and 3rd Columns of each file are being printed. I was using this to print out the 2nd and 3rd columns:
awk -v f="${i}" -F, 'match ($0,f) { print $2","$3 }' file3.csv > d$i.csv
however, say for example, file1 and file2 were null in that row, the data for that row would be shifted to the left. so I came up with this to account for the shift:
awk -v x="${i}" -F, 'match ($0,x) { if ($2='/NULL') { print "," }; else { print $2","$3}; }' alld.csv > d$i.csv
Using GNU awk for ARGIND:
$ gawk '{ a[FNR,ARGIND]=$0; maxFnr=(FNR>maxFnr?FNR:maxFnr) }
END {
for (i=1;i<=maxFnr;i++) {
for (j=1;j<ARGC;j++)
printf "%s%s", (j==1?"":",,,"), (a[i,j]?a[i,j]:",")
print ""
}
}
' file1 file2 file3
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,,,09/11,372
If you don't have GNU awk, just add an initial line that says FNR==1{ARGIND++}.
Commented version per request:
$ gawk '
{ a[FNR,ARGIND]=$0; # Store the current line in a 2-D array `a` indexed by
# the current line number `FNR` and file number `ARGIND`.
maxFnr=(FNR>maxFnr?FNR:maxFnr) # save the max FNR value
}
END{
for (i=1;i<=maxFnr;i++) { # Loop from 1 to max number of fields
# seen across all files and for each:
for (j=1;j<ARGC;j++) # Loop from 1 to total number of files parsed and:
printf "%s%s", # Print 2 strings, specifically:
(j==1?"":",,,"), # A field separator - empty if were printing
# the first field, three commas otherwise.
(a[i,j]?a[i,j]:",") # The value stored in the array if it was
# present in the files, a comma otherwise.
print "" # Print a newline
}
}
' file1 file2 file3
I originally was using an array fnr[FNR] to track the max value of FNR but IMHO that's kinda obscure and it has a flaw where if no lines have, say, a 2nd field then a loop on for (i=1;i in fnr;i++) in the END section would bail out before getting to the 3rd field.
paste is done for this:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Note that the paste alone will output just one comma:
$ paste -d, f1 f2 f3
09/05,5694,09/05,2341,09/05,1231
09/06,3234,09/06,2334,09/08,3452
09/08,2342,09/09,342,09/10,2345
,,09/11,372
So to have multiple ones we can use another delimiter like ; and then replace by ,,, with sed:
$ paste -d";" f1 f2 f3 | sed 's/;/,,,/g'
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
Using pr:
$ pr -mts',,,' file[1-3]
09/05,5694,,,09/05,2341,,,09/05,1231
09/06,3234,,,09/06,2334,,,09/08,3452
09/08,2342,,,09/09,342,,,09/10,2345
,,,,,,09/11,372
I want to combine values from multiple lines with different lengths using awk into one line if they match. In the following sample match values for first field,
aggregating values from second field into a list.
Input, sample csv:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z
How can I write an awk expression (maybe some other shell expression) to check if the first field value match with the next/previous line, and then print a list of second fields values aggregated and separated by a pipe?
awk '
BEGIN {FS=";"}
{ if ($1==prev) {sec=sec "|" $2; }
else { if (prev) { print prev ";" sec; };
prev=$1; sec=$2; }}
END { if (prev) { print prev ";" sec; }}'
This, as you requested, checks the consecutive lines.
does this oneliner work?
awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' file
tested here:
kent$ cat a
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
kent$ awk -F';' '{a[$1]=a[$1]?a[$1]"|"$2:$2;} END{for(x in a) print x";"a[x]}' a
555;f
4444;a|d|z
222;a|b
if you want to keep it sorted, add a |sort at the end.
Slightly convoluted, but does the job:
awk -F';' \
'{
if (a[$1]) {
a[$1]=a[$1] "|" $2
} else {
a[$1]=$2
}
}
END {
for (k in a) {
print k ";" a[k]
}
}' file
Assuming that you have set the field separator ( -F ) to ; :
{
if ( $1 != last ) { print s; s = ""; }
last = $1;
s = s "|" $2;
} END {
print s;
}
The first line and the first character are slightly wrong, but that's an exercise for the reader :-). Two simple if's suffice to fix that.
(Edit: Missed out last line.)
this should work:
Command:
awk -F';' '{if(a[$1]){a[$1]=a[$1]"|"$2}else{a[$1]=$2}}END{for (i in a){print i";" a[i] }}' fil
Input:
222;a;DB;a
222;b;DB;a
555;f;DB;a
4444;a;DB;a
4444;d;DB;a
4444;z;DB;a
Output:
222;a|b
555;f
4444;a|d|z