CSV grep but keep the header - linux

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux

Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.

You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.

$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Related

How to add a Header with value after a perticular column in linux

Here I want to add a column with header name Gender after column name Age with value.
cat Person.csv
First_Name|Last_Name||Age|Address
Ram|Singh|18|Punjab
Sanjeev|Kumar|32|Mumbai
I am using this:
cat Person.csv | sed '1s/$/|Gender/; 2,$s/$/|Male/'
output:
First_Name|Last_Name||Age|Address|Gender
Ram|Singh|18|Punjab|Male
Sanjeev|Kumar|32|Mumbai|Male
I want output like this:
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
I took the second pipe out (for consistency's sake) ... the sed should look like this:
$ sed -E '1s/^([^|]+\|[^|]+\|[^|]+\|)/\1Gender|/;2,$s/^([^|]+\|[^|]+\|[^|]+\|)/\1male|/' Person.csv
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|male|Punjab
Sanjeev|Kumar|32|male|Mumbai
We match and remember the first three fields and replace them with themselves, followed by Gender and male respectively.
Using awk:
$ awk -F"|" 'BEGIN{ OFS="|"}{ last=$NF; $NF=""; print (NR==1) ? $0"Gender|"last : $0"Male|"last }' Person.csv
First_Name|Last_Name||Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
Use '|' as the input field separator and set the output field separator as '|'. Store the last column value in variable named last and then remove the last column $NF="". Then print the appropriate output based on whether is first row or succeeding rows.

How to replace some cells number of .csv file if specific lines found in Linux

Lets say I have the following file.csv file content
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0.1", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0.4","no"
I want to search all lines that have APPLE and 201, and then replace the column 5 values to 0. So, my output would look like
"US","BANANA","123","100","0.5","ok"
"US","APPLE","456","201","0", "no"
"US","PIE","789","109","0.8","yes"
"US","APPLE","245","201","0","no"
I can do grep search
grep "APPLE" file.csv | grep 201
to find out the lines. But could not figure out how to modify column 5 values of these lines in the original file.
You can use awk for this:
awk -F, '$2=="\"APPLE\"" { for (i=1;i<=NF;i++) { if ($i=="\"201\"") { gsub($5,"\""substr($5,2,length($5)-1)*1.10"\"",$5) } } }1' file.csv
Set the field delimiter to , and then when the second field is equal to APPLE in quotes, loop through each field and check if it is equal to 201 in quotes. If it is, replace the 5th field with 0 in quotes using Awk's gsub function. Print each line, changed or otherwise with short-hand 1

Replace every n'th occurrence in huge line in a loop

I have this line for example:
1,2,3,4,5,6,7,8,9,10
I want to insert a newline (\n) every 2nd occurrence of "," (replace the 2nd, with newline) .
This might work for you (GNU sed):
sed 's/,/\n/2;P;D' file
If I understand what you're trying to do correctly, then
echo '1,2,3,4,5,6,7,8,9,10' | sed 's/\([^,]*,[^,]*\),/\1\n/g'
seems like the most straightforward way. \([^,]*,[^,]*\) will capture 1,2, 3,4, and so forth, and the commas between them are replaced with newlines through the usual s///g. This will print
1,2
3,4
5,6
7,8
9,10
Can't add comment to wintermutes answer but it doesn't need the first , section as it will have to have had a previous field to be comma separated.
sed 's/\(,[^,]*\),/\1\n/g'
Will work the same
Also I'll add another alternative( albeit worse and leaves a trailing newline)
echo "1,2,3,4,5,6,7,8,9,10" | xargs -d"," -n2 | tr ' ' ','
I would use awk to do this:
$ awk -F, '{ for (i=1; i<=NF; ++i) printf "%s%s", $i, (i%2?FS:RS) }' file
1,2
3,4
5,6
7,8
9,10
It loops through each field, printing each one followed by either the field separator (defined as a comma) or the record separator (a newline) depending on the value of i%2.
It's slightly longer than the sed versions presented by others, although one nice thing about it is that you can alter the number of columns per line easily by changing the 2 to whatever value you like.
To avoid a trailing comma after the last field in the case where the number of fields isn't evenly divisible, you can change the ternary to i<NF&&i%2?FS:RS.

How to use grep or awk to process a specific column ( with keywords from text file )

I've tried many combinations of grep and awk commands to process text from file.
This is a list of customers of this type:
John,Mills,81,Crescent,New York,NY,john#mills.com,19/02/1954
I am trying to separate these records into two categories, MEN and FEMALES.
I have a list of some 5000 Female Names , all in plain text , all in one file.
How can I "grep" the first column ( since I am only matching first names) but still printing the entire customer record ?
I found it easy to "cut" the first column and grep --file=female.names.txt, but this way it's not going to print the entire record any longer.
I am aware of the awk option but in that case I don't know how to read the female names from file.
awk -F ',' ' { if($1==" ???Filename??? ") print $0} '
Many thanks !
You can do this with Awk:
awk -F, 'NR==FNR{a[$0]; next} ($1 in a)' female.names.txt file.csv
Would print the lines of your csv file that contain first names of any found in your file female.names.txt.
awk -F, 'NR==FNR{a[$0]; next} !($1 in a)' female.names.txt file.csv
Would output lines not found in female.names.txt.
This assumes the format of your female.names.txt file is something like:
Heather
Irene
Jane
Try this:
grep --file=<(sed 's/.*/^&,/' female.names.txt) datafile.csv
This changes all the names in the list of female names to the regular expression ^name, so it only matches at the beginning of the line and followed by a comma. Then it uses process substitution to use that as the file to match against the data file.
Another alternative is Perl, which can be useful if you're not super-familiar with awk.
#!/usr/bin/perl -anF,
use strict;
our %names;
BEGIN {
while (<ARGV>) {
chomp;
$names{$_} = 1;
}
}
print if $names{$F[0]};
To run (assume you named this file filter.pl):
perl filter.pl female.names.txt < records.txt
So, I've come up with the following:
Suppose, you have a file having the following lines in a file named test.txt:
abe 123 bdb 532
xyz 593 iau 591
Now you want to find the lines which include the first field having the first and last letters as vowels. If you did a simple grep you would get both of the lines but the following will give you the first line only which is the desired output:
egrep "^([0-z]{1,} ){0}[aeiou][0-z]+[aeiou]" test.txt
Then you want to the find the lines which include the third field having the first and last letters as vowels. Similary, if you did a simple grep you would get both of the lines but the following will give you the second line only which is the desired output:
egrep "^([0-z]{1,} ){2}[aeiou][0-z]+[aeiou]" test.txt
The value in the first curly braces {1,} specifies that the preceding character which ranges from 0 to z according to the ASCII table, can occur any number of times. After that, we have the field separator space in this case. Change the value within the second curly braces {0} or {2} to the desired field number-1. Then, use a regular expression to mention your criteria.

How to use Linux command(sed?) to delete specific lines in a file?

I have a file that contains a matrix. For example, I have:
1 a 2 b
2 b 5 b
3 d 4 b
4 b 7 b
I know it's easy to use sed command to delete specific lines with specific strings. But what if I only want to delete those lines where the second field's value is b (i.e., second line and fourth line)?
You can use regex in sed.
sed -i 's/^[0-9]\s+b.*//g' xxx_file
or
sed -i '/^[0-9]\s+b.*/d' xxx_file
The "-i" argument will modify the file's content directly, you can remove "-i" and output the result to other files as you want.
Awk just work fine, just use code as below:
awk '{if ($2 != "b") print $0;}' file
if you want get more usage about awk, just man it!
awk:
cat yourfile.txt | awk '{if($2!="b"){print;}}'

Resources