The difference between awk $1 and $NF? - linux

case 1:
echo 'ABC-dev-test.zip' | awk -F'^ABC' {print $1}
output : null
case 2:
echo 'ABC-dev-test.zip' | awk -F'^ABC' {print $NF}
output : -dev-test.zip
I wonder. Why it comes out like this. $1 is the first record, $NF is the number of records. In the end they both point to the first, but I think running it should give you the same value.
Why is it different?

You are getting 1st results empty because you don't have anything before ABC in your input(and its going to be always empty since we are making specifically starting of the line ABC as field separator NOT any ABC in line), when you are making it field separator then it means things coming before it will be considered as 1st field($1) which is NOT there, hence your first command is not printing anything.
Let's run following command to see how many fields we have with shown samples and what are their respective values:
echo 'ABC-dev-test.zip' | awk -F'^ABC' '{for(i=1;i<=NF;i++){print "Field Number:"i " Field value is:" $i}}'
Field Number:1 Field value is:
Field Number:2 Field value is:-dev-test.zip
You could clearly see that 1st field is empty after making ABC as field separator with your shown samples, while $NF(which means last field of current line) works because we have -dev-test.zip after ABC in your shown samples.
Additional note: Looks like you are making ABC which is starting from line, in case you want to make ABC as field separator then if you have like: XYZ-ABC-dev-test.zipABC you will get XYZ- as 1st field value here.
Let's test this for string ABC-dev-test.zipABC-resvalues where we have 2 ABC values in it.
When we run it with making ^ABC as field separator see this: First field is empty and moreover 2nd ABC is not getting caught as a field separator here.
echo 'ABC-dev-test.zipABC-resvalues' | awk -F'^ABC' '{for(i=1;i<=NF;i++){print "Field Number:"i " Field value is:" $i}}'
Field Number:1 Field value is:
Field Number:2 Field value is:-dev-test.zipABC-resvalues
When we change field separator to ABC then see this: Its catching all ABC occurrences in whole value and treating them as a field separator.
echo 'ABC-dev-test.zipABC-resvalues' | awk -F'ABC' '{for(i=1;i<=NF;i++){print "Field Number:"i " Field value is:" $i}}'
Field Number:1 Field value is:
Field Number:2 Field value is:-dev-test.zip
Field Number:3 Field value is:-resvalues

Related

Linux Command to get fields from CSV files

In csv files on Linux server, I have thousands of rows in below csv format
0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|
I need to get output from all the files on below format (2nd field ie 20221208195546466 and 5th field but value after Above as: and before first | ie 2 in above example )
output :
20221208195546466 , 2
Can anyone help me with linux command ?
Edit :
my attempts
I tried but it give field 5th value. How to add field 2 as well ?
cat *.csv | cut -d, -f5|cut -d'|' -f1|cut -d':' -f2|
EDIT : sorted result
Now I am using this command (based on Dave Pritlove answer ) awk -F'[,|:]' '{print $2", "$6}' file.csv. However, I have one more query, If I have to sort the output based on $6 ( value 2 in your example ) then how can i do it ? I want result should be displayed in sorted order based on 2nd output field.
for ex :
20221208195546366, 20
20221208195546436, 16
20221208195546466, 5
2022120819536466, 2
Gnu awk allows multiple field separators to be set, allowing you to delimit each record at ,, |, and : at the same time. Thus, the following will fish out the required fields from file.csv:
awk -F'[,|:]' '{print $2", "$6}' file.csv
Tested on the single record example:
echo "0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|" | awk -F'[,|:]' '{print $2", "$6}'
output:
20221208195546466, 2
Assumptions:
starting string of the 5th comma-delimited field can vary from line to line (ie, not known before hand)
the item of interest in the 5th comma-delimited field occurs between the first : and the first |
Sample data:
$ cat test.csv
0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|
1,20230124123456789,10,1730,Total ts:7|stuff:HAER:0|morestuff:FON:0|yetmorestuff:ION:0|
One awk approach:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiter as ","
{ split($5,a,"[:|]") # split 5th field on dual delimiters ":" and "|", store results in array a[]
print $2,a[2] # print desired items to stdout
}
' test.csv
This generates:
20221208195546466,2
20230124123456789,7
You can use awk for this:
awk -F',' '{gsub(/Above as:/,""); gsub(/\|.*/, ""); print($2, $5)}'
Probably need to adopt regexp a bit.
You might change : to , and | to , then extract 2nd and 6th field using cut following way, let file.txt content be
0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|
then
tr ':|' ',,' < file.txt | cut --delimiter=',' --output-delimiter=' , ' --fields=2,6
gives output
20221208195546466 , 2
Explanation: tr translates i.e. replace : using , and replace | using , then I inform cut that delimiter in input is , output delimiter is , encased in spaces (as stipulated by your desired output) and want 2th and 6th column (not 5th, as it is now Above as)
(tested using GNU coreutils 8.30)

How to add a Header with value after a perticular column in linux

Here I want to add a column with header name Gender after column name Age with value.
cat Person.csv
First_Name|Last_Name||Age|Address
Ram|Singh|18|Punjab
Sanjeev|Kumar|32|Mumbai
I am using this:
cat Person.csv | sed '1s/$/|Gender/; 2,$s/$/|Male/'
output:
First_Name|Last_Name||Age|Address|Gender
Ram|Singh|18|Punjab|Male
Sanjeev|Kumar|32|Mumbai|Male
I want output like this:
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
I took the second pipe out (for consistency's sake) ... the sed should look like this:
$ sed -E '1s/^([^|]+\|[^|]+\|[^|]+\|)/\1Gender|/;2,$s/^([^|]+\|[^|]+\|[^|]+\|)/\1male|/' Person.csv
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|male|Punjab
Sanjeev|Kumar|32|male|Mumbai
We match and remember the first three fields and replace them with themselves, followed by Gender and male respectively.
Using awk:
$ awk -F"|" 'BEGIN{ OFS="|"}{ last=$NF; $NF=""; print (NR==1) ? $0"Gender|"last : $0"Male|"last }' Person.csv
First_Name|Last_Name||Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
Use '|' as the input field separator and set the output field separator as '|'. Store the last column value in variable named last and then remove the last column $NF="". Then print the appropriate output based on whether is first row or succeeding rows.

How to replace some cells number of .csv file if specific lines found in Linux

Lets say I have the following file.csv file content
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0.1", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0.4","no"
I want to search all lines that have APPLE and 201, and then replace the column 5 values to 0. So, my output would look like
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0","no"
I can do grep search
grep "APPLE" file.csv | grep 201
to find out the lines. But could not figure out how to modify column 5 values of these lines in the original file.
You can use awk for this:
awk -F, '$2=="\"APPLE\"" { for (i=1;i<=NF;i++) { if ($i=="\"201\"") { gsub($5,"\""substr($5,2,length($5)-1)*1.10"\"",$5) } } }1' file.csv
Set the field delimiter to , and then when the second field is equal to APPLE in quotes, loop through each field and check if it is equal to 201 in quotes. If it is, replace the 5th field with 0 in quotes using Awk's gsub function. Print each line, changed or otherwise with short-hand 1

How to remove a value between the 5th and 6th separator using SED?

Remove value between 5 and 6 separator:
000000000000;00000000000000;2;NONE;true;526;246;101;100;2;1;;;;;;8;101/100.0000.99.99;526/125.000.122.000
We need to get:
000000000000;00000000000000;2;NONE;true;;246;101;100;2;1;;;;;;8;101/100.0000.99.99;526/125.000.122.000
Using awk you can do this:
s='000000000000;00000000000000;2;NONE;true;526;246;101;100;2;1;;;;;;8;101/100.0000.99.99;526/125.000.122.000'
awk 'BEGIN{FS=OFS=";"} {$6=""} 1' <<< "$s"
000000000000;00000000000000;2;NONE;true;;246;101;100;2;1;;;;;;8;101/100.0000.99.99;526/125.000.122.000
FS=OFS=";" sets input and output field separators as ;
$6="" makes 6th field empty
1 prints the whole record
Let's define your string as s:
$ s='000000000000;00000000000000;2;NONE;true;526;246;101;100;2;1;;;;;;8;101/100.0000.99.99;526/125.000.122.000'
To remove the sixth field:
$ echo "$s" | sed -E 's/(([^;]*;){5})[^;]*/\1/'
000000000000;00000000000000;2;NONE;true;;246;101;100;2;1;;;;;;8;101/100.0000.99.99;526/125.000.122.000
How it works
We use a single sed substitution command:
s/(([^;]*;){5})[^;]*/\1/
Here, (([^;]*;){5}) matches the first five fields and saves them in group 1.
[^;]* matches the field that follows. In other words, it matches the sixth field.
The replacement text is just \1 which means group 1 which is the first five fields. Thus, the sixth field is removed and not replaced.

Is there a way to remove only the followed duplicates?

I have a CSV input with these columns:
1,zzzz,xxxx,
1,xxxx,xyxy,
2,xxxx,xxxx,
3,yyyy,xxxx,
3,xxxx,yyyy,
3,xxxx,zzzz,
1,ffff,xxxx,
1,aaaa,xxxx,
And I need to discard lines where the first field matches that of the preceding line:
1,zzzz,xxxx,
2,xxxx,xxxx,
3,yyyy,xxxx,
1,ffff,xxxx,
I tried sort | uniq alone but didn't work because all lines are different with exception of first field (number).
Use awk instead of uniq:
awk -F, '$1 != last { last=$1; print }'
-F, sets the field separator to comma. $1 is the contents of the first field, so this prints the line whenever the first field changes.
Got the wanted output with uniq --check-chars=N; the uniq will check only a specified number of characters in the lines, and since the input isn't sorted this will allow the characters to appear later on the list.

Resources