I need to validate and clean a field in CSV. There is column for IP address and I need to remove only invalid data inside that column.
I tried the following command :
awk 'BEGIN{ FS=OFS="," }{ gsub(/^([0-9]{1,3}[\.]){3}[0-9]{1,3}$/,"", $3) }1' input.csv
Input file
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
Current output
anna,new york,,usa
james,denver,,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
Expected output
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
This command remove the matching data, but I need the opposite. How do I remove only the non-matching data in the IP column ?
How do I remove only the non-matching data in the IP column ?
You might combine following string functions: match substr for this task following way
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3 male,usa
then
awk 'BEGIN{FS=OFS=","}{$3=match($3,/([0-9]{1,3}[\.]){3}[0-9]{1,3}/)?substr($3,RSTART,RLENGTH):"";print}' file.txt
gives output
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
Explanation: I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for each line I use so called ternary operator condition?valueiftrue:valueiffalse, condition is if $3 does match regular expression, observe that I altered it slightly, so it does hold if IP is somewhere inside, rather than span whole column. If match found I use substr to get substring which does correspond to match using RSTART, RLENGTH which were set by match, otherwise I use empty string. After that I print whole line.
(tested in gawk 4.2.1)
If your CSV is as simple as what you show (one line per record, no commas inside fields, no quoted fields, no leading or trailing spaces in fields...), and after removing the male in 10.2.8.3 male (is it a typo?), you could try:
$ awk -F, -v OFS=, '$3 !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/ {$3 = ""} {print}' input.csv
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
And if you want to check that the 3rd field is really a valid full IP address (no subnets):
$ cat filter.awk
function isIP(v) {
if(v !~ /^([0-9]{1,3}\.){3}[0-9]{1,3}$/)
return 0;
split(v, a, /\./)
for(i = 1; i <= 4 ; i++) {
if(a[i] > 255) {
return 0;
}
}
return 1
}
BEGIN { FS = ","; OFS = "," }
! isIP($3) {$3 = ""}
{print}
$ cat input.csv
bob,LA,292.168.1.5,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,colarado,usa
tommy,new york,10.2.8.3,usa
$ awk -f filter.awk input.csv
bob,LA,,usa
anna,new york,192.168.1.5,usa
james,denver,240.210.1.8,usa
peter,denver,,usa
tommy,new york,10.2.8.3,usa
I receive and CSV like this:
column$1,column$2,column$
john,P,10
john,P,10
john,A,20
john,T,30
john,T,10
marc,P,10
marc,C,10
marc,C,20
marc,T,30
marc,A,10
I need so sum the values and display the name and results but column$2 needs to show the sum of values T separated from values P,A,C.
Output should be this:
column$1,column$2,column$3,column$4
john,PCA,40
john,T,40,CORRECT
marc,PCA,50
marc,T,30,INCORRECT
All i could do was extract the columns i need from the original csv:
awk -F "|" '{print $8 "|" $9 "|" $4}' input.csv >> output.csv
Also sort by the correct column:
sort -t "|" -k1 input.csv >> output.csv
And add a new column to the end of the csv:
awk -F, '{NF=2}1' OFS="|" input.csv >> output.csv
I managed to sum and display the sum by column$1 and $2, but i don't how to group different values from column$2:
awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}' file > output
Awk is stream oriented. It processes input and outputs what you change. It does not do in file changes.
You just need to add a corresponding print
awk '{if($2 == "T") {print "MATCHED"}}'
If you want to output more than the "matched" you need to add it to the print
e.g. '{print $1 "|" $2 "|" $3 "|" " MATCHED"}'
or use print $0 as comment mentions above.
Assuming that "CORRECT" and "INCORRECT" are determined by comparing the "PCA" value to the "T" value, the following awk script should do the trick:
awk -F, -vOFS=, '$2=="T"{t[$1]+=$3;n[$1]} $2!="T"{s[$1]+=$3;n[$1]} END{ for(i in n){print i,"PCA",s[i]; print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")} }' inputfile
Broken out for easier reading, here's what this looks like:
awk -F, -vOFS=, '
$2=="T" { # match all records that are "T"
t[$1]+=$3 # add the value for this record to an array of totals
n[$1] # record this name in our authoritative name list
}
$2!="T" { # match all records that are NOT "T"
s[$1]+=$3 # add the value for this record to an array of sums
n[$1] # record this name too
}
END { # Now that we've collected data, analyse the results
for (i in n) { # step through our authoritative list of names
print i,"PCA",s[i]
print i,"T",t[i],(t[i]==s[i] ? "CORRECT" : "INCORRECT")
}
}
' inputfile
Note that array order is not guaranteed in awk, so your output may not come out in the same order as your input.
If you want your output to be delimited using vertical bars, change the -vOFS=, to -vOFS='|'.
Then you can sort using:
awk ... | sort
which defaults to -k1.
for example I have a sentence with words which starts with ! in a log file
0 1 ! abs tHfih(t) qcds bbc(u)
so with a code as below I can find that line
awk '
/[Tt][Hh][Ff]/ { if ($3 ~ /!/) {print "a"; exit 0}}
how can I tell the awk to print the whole line and the complete word which contains thf "tHfih(t)"?
print the line
awk '
/[Tt][Hh][Ff]/ { if ($3 ~ /!/) {print "the line containing the match"; exit 0}}
print the word
awk '
/[Tt][Hh][Ff]/ { if ($3 ~ /!/) {print "the word containing the match"; exit 0}}
this might be simpler
awk 'tolower($0) ~ /thf/ && $3=="!"'
UPDATE
if you don't know the position of the searched field. You can scan all fields for a match. For example, for the lines that has ! on third position print the line number and word that contains thf case insensitive
awk '$3=="!"{for(i=1;i<=NF;i++) if(tolower($i)~/thf/) print NR, $i}'
UPDATE 2
if you want to switch matching words vs line
awk -vw=1 '$3=="!"{for(i=1;i<=NF;i++) if(tolower($i)~/thf/) print w?$i:$0}' file
set w=0 for full line printing and to 1 for word printing. Note that this assumes a single match in the line, otherwise it will print all matches (and that many lines in line mode).
Here is my problem
I have a File 1 where I have some data
Var1.1 Var1.2 Var1.3
Var2.1 Var2.2 Var2.3
Var3.1 Var3.2 Var3.3
And I have a File 2 that I would like edit thanks to the above data
File2 (1)
***pattern with Var2.1***
some text...
File2(2)
***pattern with Var2.1***
Here I want to add Var2.2 and Var2.3
some text
My first solution is to use AWK, but I don't know to include a bash command in. The AWK should make something like:
Search the pattern in the File2
When awk get it, awk calls a script which returns the wanted values from the File1.
Then awk can edit the File2
don't hesitate to explain me other possibilities if there are which are more simple !
Thank you !
This is how I run an external command from within awk to base64-decode a string:
cmd = "/usr/bin/base64 -i -d <<< " $2 " 2>/dev/null"
while ( ( cmd | getline result ) > 0 ) { }
close(cmd)
split(result, a, "[:=,]")
name=a[2]
Perhaps you can get some inspiration from it...
There's no need to run an external script to accomplish what you want. It can be done completely within a short AWK script.
awk 'FNR == NR {arr[$1] = $2 " " $3; next} {print; for (lookup in arr) {if ($0 ~ lookup) {split(arr[lookup], a); print "Here I want to add " a[1] " and " a[2]}}}' File1 File2
Explanation:
FNR == NR {arr[$1] = $2 " " $3; next} - Loop through the first file and save all the values in an array indexed by the first column. The record number equals the file record number for the first file.
print - Print every input line.
for (lookup in arr) {if ($0 ~ lookup) { - Loop through each of the array indices and see if the input line matches.
split(arr[lookup], a) - Split the value stored at the matched index into a temporary array.
print "Here I want to add " a[1] " and " a[2] - Print some text using the two values resulting from the split.
here is column 6 in a file:
ttttttttttt
tttttttttt
ttttttttt
tttttttattt
tttttttttt
ttttttttttt
how can I use awk to print out lines that include "a"
If you only want to search the sixth column, use:
awk '$6 ~ /a/' file
If you want the whole line, any of these should work:
awk /a/ file
grep a file
sed '/^[^a]*$/d' file
If you wish to print only those lines in which 6th column contains a then this would work -
awk '$6~/a/' file
if it is an exact match (which yours is not) you're looking for:
$6 == "a"
http://www.pement.org/awk/awk1line.txt
is an excellent resource
awk can also tell you where the pattern is in the column:
awk '{++line_num}{ if ( match($6,"a")) { print "found a at position",RSTART, " line " ,line_num} }' file
though this example will only show the first "a" in column 6; a for loop would be needed to show all instances (I think)
You could try
gawk '{ if ( $1 ~ /a/ ) { print $1 } }' filename