How to replace some cells number of .csv file if specific lines found in Linux - linux

Lets say I have the following file.csv file content
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0.1", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0.4","no"
I want to search all lines that have APPLE and 201, and then replace the column 5 values to 0. So, my output would look like
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0","no"
I can do grep search
grep "APPLE" file.csv | grep 201
to find out the lines. But could not figure out how to modify column 5 values of these lines in the original file.

You can use awk for this:
awk -F, '$2=="\"APPLE\"" { for (i=1;i<=NF;i++) { if ($i=="\"201\"") { gsub($5,"\""substr($5,2,length($5)-1)*1.10"\"",$5) } } }1' file.csv
Set the field delimiter to , and then when the second field is equal to APPLE in quotes, loop through each field and check if it is equal to 201 in quotes. If it is, replace the 5th field with 0 in quotes using Awk's gsub function. Print each line, changed or otherwise with short-hand 1

Related

Linux Command to get fields from CSV files

In csv files on Linux server, I have thousands of rows in below csv format
0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|
I need to get output from all the files on below format (2nd field ie 20221208195546466 and 5th field but value after Above as: and before first | ie 2 in above example )
output :
20221208195546466 , 2
Can anyone help me with linux command ?
Edit :
my attempts
I tried but it give field 5th value. How to add field 2 as well ?
cat *.csv | cut -d, -f5|cut -d'|' -f1|cut -d':' -f2|
EDIT : sorted result
Now I am using this command (based on Dave Pritlove answer ) awk -F'[,|:]' '{print $2", "$6}' file.csv. However, I have one more query, If I have to sort the output based on $6 ( value 2 in your example ) then how can i do it ? I want result should be displayed in sorted order based on 2nd output field.
for ex :
20221208195546366, 20
20221208195546436, 16
20221208195546466, 5
2022120819536466, 2
Gnu awk allows multiple field separators to be set, allowing you to delimit each record at ,, |, and : at the same time. Thus, the following will fish out the required fields from file.csv:
awk -F'[,|:]' '{print $2", "$6}' file.csv
Tested on the single record example:
echo "0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|" | awk -F'[,|:]' '{print $2", "$6}'
output:
20221208195546466, 2
Assumptions:
starting string of the 5th comma-delimited field can vary from line to line (ie, not known before hand)
the item of interest in the 5th comma-delimited field occurs between the first : and the first |
Sample data:
$ cat test.csv
0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|
1,20230124123456789,10,1730,Total ts:7|stuff:HAER:0|morestuff:FON:0|yetmorestuff:ION:0|
One awk approach:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiter as ","
{ split($5,a,"[:|]") # split 5th field on dual delimiters ":" and "|", store results in array a[]
print $2,a[2] # print desired items to stdout
}
' test.csv
This generates:
20221208195546466,2
20230124123456789,7
You can use awk for this:
awk -F',' '{gsub(/Above as:/,""); gsub(/\|.*/, ""); print($2, $5)}'
Probably need to adopt regexp a bit.
You might change : to , and | to , then extract 2nd and 6th field using cut following way, let file.txt content be
0,20221208195546466,9,200,Above as:2|RAN34f2fb:HAER:0|RAND8365b2bca763:FON:0|RANDa7a5f964900b:ION:0|
then
tr ':|' ',,' < file.txt | cut --delimiter=',' --output-delimiter=' , ' --fields=2,6
gives output
20221208195546466 , 2
Explanation: tr translates i.e. replace : using , and replace | using , then I inform cut that delimiter in input is , output delimiter is , encased in spaces (as stipulated by your desired output) and want 2th and 6th column (not 5th, as it is now Above as)
(tested using GNU coreutils 8.30)

How to add a Header with value after a perticular column in linux

Here I want to add a column with header name Gender after column name Age with value.
cat Person.csv
First_Name|Last_Name||Age|Address
Ram|Singh|18|Punjab
Sanjeev|Kumar|32|Mumbai
I am using this:
cat Person.csv | sed '1s/$/|Gender/; 2,$s/$/|Male/'
output:
First_Name|Last_Name||Age|Address|Gender
Ram|Singh|18|Punjab|Male
Sanjeev|Kumar|32|Mumbai|Male
I want output like this:
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
I took the second pipe out (for consistency's sake) ... the sed should look like this:
$ sed -E '1s/^([^|]+\|[^|]+\|[^|]+\|)/\1Gender|/;2,$s/^([^|]+\|[^|]+\|[^|]+\|)/\1male|/' Person.csv
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|male|Punjab
Sanjeev|Kumar|32|male|Mumbai
We match and remember the first three fields and replace them with themselves, followed by Gender and male respectively.
Using awk:
$ awk -F"|" 'BEGIN{ OFS="|"}{ last=$NF; $NF=""; print (NR==1) ? $0"Gender|"last : $0"Male|"last }' Person.csv
First_Name|Last_Name||Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
Use '|' as the input field separator and set the output field separator as '|'. Store the last column value in variable named last and then remove the last column $NF="". Then print the appropriate output based on whether is first row or succeeding rows.

Splitting the first column of a file in multiple columns using AWK

File looks like this, but with millions of lines (TAB separated):
1_number_column_ranking_+ 100 200 Target "Hello"
I want to split the first column by the _ so it becomes:
1 number column ranking + 100 200 Target "Hello"
This is the code I have been trying:
awk -F"\t" '{n=split($1,a,"_");for (i=1;i<=n;i++) print $1"\t"a[i]}'
But it's not quite what I need.
Any help is appreciated (the other threads on this topic were not helpful for me).
No need to split, just replace would do:
awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1'
Eg:
$ cat file
1_number_column_ranking_+ 100 200 Target "Hello"
$ awk 'BEGIN{FS=OFS="\t"}{gsub("_","\t",$1)}1' file
1 number column ranking + 100 200 Target "Hello"
gsub will replace all occurances, when no 3rd argument given, it will replace in $0.
Last 1 is a shortcut for {print}. (always true, implied {print}.)
Another awk, if the "_" appears only in the first column.
Split the input field by regex "[_\t]+" and just do a dummy operation like $1=$1 in the main section, so that $0 is reconstructed with OFS="\t"
$ cat steveman.txt
1_number_column_ranking_+ 100 200i Target "Hello"
$ awk -F"[_\t]" ' BEGIN { OFS="\t"} { $1=$1; print } ' steveman.txt
1 number column ranking + 100 200i Target "Hello"
$
Thanks #Ed, updated from -F"[_\t]+" to -F"[_\t]" that will avoid concatenating empty fields.

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux
Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.
You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.
$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Resources