Linux bash script: how to search on a column but return full row? - linux

I have a tab-delimited file with data like this:
col1 col2 col3
I wrote a bash script that allows the file to be searched using this code:
echo -en "Search term: "
read search
data=`cat data.data | egrep -i "$search"`
This works great for searching the entire file, but I'm now wanting to search only on a specific column (which the user can choose).
I am aware of the cut command and can search on a column using this:
cat data.data | cut -f$col | egrep -i "$search"
But then only that column is outputted, so if I use this method then I somehow need to get the rest of the row back.
How can I search on a column in the file, but return the full rows for the results?

You can pass two variables to awk: the column number and the search term.
awk -vcol="$col" -vsearch="$search" '$col ~ search' data.data
If the value of $col is 2, then $2 in awk will correspond to the second column. The ~ operator is used to do a regular expression pattern match. The line will be printed if the column matches the regular expression.
Testing it out:
$ cat data.data
col1 col2 col3
$ col=2
$ search=l2
$ awk -vcol="$col" -vsearch="$search" '$col ~ search' data.data
col1 col2 col3
$ search=l3
$ awk -vcol="$col" -vsearch="$search" '$col ~ search' data.data
# no output
If you want to do case-insensitive pattern matching, you have two options: convert everything to upper or lower case (tolower($col) ~ tolower(search)), or if you are using GNU awk, set the IGNORECASE variable:
$ search=L2
$ awk -vIGNORECASE=1 -vcol="$col" -vsearch="$search" '$col ~ search' data.data
col1 col2 col3

awk is easier for this:
data=$(awk -v col=$col -v term="$term" 'toupper($col)==toupper(term)' file)
col - column number
term - search term
You could also pass field separator with -F if needed.

Related

How to add a Header with value after a perticular column in linux

Here I want to add a column with header name Gender after column name Age with value.
cat Person.csv
First_Name|Last_Name||Age|Address
Ram|Singh|18|Punjab
Sanjeev|Kumar|32|Mumbai
I am using this:
cat Person.csv | sed '1s/$/|Gender/; 2,$s/$/|Male/'
output:
First_Name|Last_Name||Age|Address|Gender
Ram|Singh|18|Punjab|Male
Sanjeev|Kumar|32|Mumbai|Male
I want output like this:
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
I took the second pipe out (for consistency's sake) ... the sed should look like this:
$ sed -E '1s/^([^|]+\|[^|]+\|[^|]+\|)/\1Gender|/;2,$s/^([^|]+\|[^|]+\|[^|]+\|)/\1male|/' Person.csv
First_Name|Last_Name|Age|Gender|Address
Ram|Singh|18|male|Punjab
Sanjeev|Kumar|32|male|Mumbai
We match and remember the first three fields and replace them with themselves, followed by Gender and male respectively.
Using awk:
$ awk -F"|" 'BEGIN{ OFS="|"}{ last=$NF; $NF=""; print (NR==1) ? $0"Gender|"last : $0"Male|"last }' Person.csv
First_Name|Last_Name||Age|Gender|Address
Ram|Singh|18|Male|Punjab
Sanjeev|Kumar|32|Male|Mumbai
Use '|' as the input field separator and set the output field separator as '|'. Store the last column value in variable named last and then remove the last column $NF="". Then print the appropriate output based on whether is first row or succeeding rows.

Filtering on a condition using the column names and not numbers

I am trying to filter a text file with columns based on two conditions. Due to the size of the file, I cannot use the column numbers (as there are thousands and are unnumbered) but need to use the column names. I have searched and tried to come up with multiple ways to do this but nothing is returned to the command line.
Here are a few things I have tried:
awk '($colname1==2 && $colname2==1) { count++ } END { print count }' file.txt
to filter out the columns based on their conditions
and
head -1 file.txt | tr '\t' | cat -n | grep "COLNAME
to try and return the possible column number related to the column.
An example file would be:
ID ad bd
1 a fire
2 b air
3 c water
4 c water
5 d water
6 c earth
Output would be:
2 (count of ad=c and bd=water)
with your input file and the implied conditions this should work
$ awk -v c1='ad' -v c2='bd' 'NR==1{n=split($0,h); for(i=1;i<=n;i++) col[h[i]]=i}
$col[c1]=="c" && $col[c2]=="water"{count++} END{print count+0}' file
2
or you can replace c1 and c2 with the values in the script as well.
to find the column indices you can run
$ awk -v cols='ad bd' 'BEGIN{n=split(cols,c); for(i=1;i<=n;i++) colmap[c[i]]}
NR==1{for(i=1;i<=NF;i++) if($i in colmap) print $i,i; exit}' file
ad 2
bd 3
or perhaps with this chain
$ sed 1q file | tr -s ' ' \\n | nl | grep -E 'ad|bd'
2 ad
3 bd
although may have false positives due to regex match...
You can rewrite the awk to be more succinct
$ awk -v cols='ad bd' '{while(++i<=NF) if(FS cols FS ~ FS $i FS) print $i,i;
exit}' file
ad 2
bd 3
As I mentioned in an earlier comment, the answer at https://unix.stackexchange.com/a/359699/133219 shows how to do this:
awk -F'\t' '
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
($(f["ad"]) == "c") && ($(f["bd"]) == "water") { cnt++ }
END { print cnt+0 }
' file
2
I'm assuming your input is tab-separated due to the tr '\t' in the command in your question that looks like you're trying to convert tabs to newlines to convert column names to numbers. If I'm wrong and they're just separated by any chains of white space then remove -F'\t' from the above.
Use miller toolkit to manipulate tab-delimited files using column names. Below is a one-liner that filters a tab-delimited file (delimiter is specified using --tsv) and writes the results to STDOUT together with the header. The header is removed using tail and the lines are counted with wc.
mlr --tsv filter '$ad == "c" && $bd == "water"' file.txt | tail -n +2 | wc -l
Prints:
2
SEE ALSO:
miller manual
Note that miller can be easily installed, for example, using conda, like so:
conda create --name miller miller
For years it bugged me there is no succinct way in Unix to do this sort of thing, although miller is a pretty good tool for this. Recently I wrote pick to choose columns by name, and additionally modify, combine and add them by name, as well as filtering rows by clauses using column names. The solution to the above with pick is
pick -h #ad=c #bd=water < data.txt | wc -l
By default pick prints the header of the selected columns, -h is to omit it. To print columns you simply name them on the command line, e.g.
pick ad water < data.txt | wc -l
Pick has many modes, all of them focused on manipulating columns and selecting/filtering rows with a minimal amount of syntax.

linux: extract pattern from file

I have a big tab delimited .txt file of 4 columns
col1 col2 col3 col4
name1 1 2 ens|name1,ccds|name2,ref|name3,ref|name4
name2 3 10 ref|name5,ref|name6
... ... ... ...
Now I want to extract from this file everything that starts with 'ref|'. This pattern is only present in col4
So for this example I would like to have as output
ref|name3
ref|name4
ref|name5
ref|name6
I thought of using 'sed' for this, but I don't know where to start.
I think awk is better suited for this task:
$ awk '{for (i=1;i<=NF;i++){if ($i ~ /ref\|/){print $i}}}' FS='( )|(,)' infile
ref|name3
ref|name4
ref|name5
ref|name6
FS='( )|(,)' sets a multile FS to itinerate columns by , and blank spaces, then prints the column when it finds the ref pattern.
Now I want to extract from this file everything that starts with
'ref|'. This pattern is only present in col4
If you are sure that the pattern only present in col4, you could use grep:
grep -o 'ref|[^,]*' file
output:
ref|name3
ref|name4
ref|name5
ref|name6
One solution I had was to first use awk to only get the 4th column, then use sed to convert commas into newlines, and then use grep (or awk again) to get the ones that start with ref:
awk '{print $4}' < data.txt | sed -e 's/,/\n/g' | grep "^ref"
This might work for you (GNU sed):
sed 's/\(ref|[^,]*\),/\n\1\n/;/^ref/P;D' file
Surround the required strings by newlines and only print those lines that begin with the start of the required string.

Conditional Awk hashmap match lookup

I have 2 tabular files. One file contains a mapping of 50 key values only called lookup_file.txt.
The other file has the actual tabular data with 30 columns and millions of rows. data.txt
I would like to replace the id column of the second file with the values from the lookup_file.txt..
How can I do this? I would prefer using awk in bash script..
Also, Is there a hashmap data-structure i can use in bash for storing the 50 key/values rather than another file?
Assuming your files have comma-separated fields and the "id column" is field 3:
awk '
BEGIN{ FS=OFS="," }
NR==FNR { map[$1] = $2; next }
{ $3 = map[$3]; print }
' lookup_file.txt data.txt
If any of those assumptions are wrong, clue us in if the fix isn't obvious...
EDIT: and if you want to avoid the (IMHO negligible) NR==FNR test performance impact, this would be one of those every rare cases when use of getline is appropriate:
awk '
BEGIN{
FS=OFS=","
while ( (getline line < "lookup_file.txt") > 0 ) {
split(line,f)
map[f[1]] = f[2]
}
}
{ $3 = map[$3]; print }
' data.txt
You could use a mix of "sort" and "join" via bash instead of having to write it in awk/sed and it is likely to be even faster:
key.cvs (id, name)
1,homer
2,marge
3,bart
4,lisa
5,maggie
data.cvs (name,animal,owner,age)
snowball,dog,3,1
frosty,yeti,1,245
cujo,dog,5,4
Now, you need to sort both files first on the user id columns:
cat key.cvs | sort -t, -k1,1 > sorted_keys.cvs
cat data.cvs | sort -t, -k3,3 > sorted_data.cvs
Now join the 2 files:
join -1 1 -2 3 -o "2.1 2.2 1.2 2.4" -t , sorted_keys.cvs sorted_data.cvs > replaced_data.cvs
This should produce:
snowball,dog,bart,1
frosty,yeti,homer,245
cujo,dog,maggie,4
This:
-o "2.1 2.2 1.2 2.4"
Is saying what columns from the 2 files you want in your final output.
It is pretty fast for finding and replacing multiple gigs of data compared to other scripting languages. I haven't done a direct comparison to SED/AWK, but it is much easier to write a bash script wrapping this than writing in SED/AWK (for me at least).
Also, you can speed up the sort by using an upgraded version of gnu coreutils so that you can do the sort in parallel
cat data.cvs | sort --parallel=4 -t, -k3,3 > sorted_data.cvs
4 being how many threads you want to run it in. I was recommended 2 threads per machine core will usually max out the machine, but if it is dedicated just for this, that is fine.
There are several ways to do this. But if you want an easy one liner, without much in the way of validation I would go with an awk/sed solution.
Assume the following:
the files are tab delimited
you are using bash shell
the id in the data file is in the first column
your files look like this:
lookup
1 one
2 two
3 three
4 four
5 five
data
1 col2 col3 col4 col5
2 col2 col3 col4 col5
3 col2 col3 col4 col5
4 col2 col3 col4 col5
5 col2 col3 col4 col5
I would use awk and sed to accomplish this task like this:
awk '{print "sed -i s/^"$1"/"$2"/ data"}' lookup | bash
what this is doing is going through each line of lookup and writing the following to stdout
sed -i s/^1/one/ data
sed -i s/^2/two/ data
and so on.
it next pipes each line to the shell (| bash), which will execute the sed expression. -i for inplace, you may want -i.bak to create a backup file. note you can change the extension to whatever you would like.
the sed is looking for the id at the start of the line, as indicated by the ^. You don't want to be replacing an 'id' in a column that might not contain an id.
your output would look like the following:
one col2 col3 col4 col5
two col2 col3 col4 col5
three col2 col3 col4 col5
four col2 col3 col4 col5
five col2 col3 col4 col5
of course, your ids are probably not simply 1 to one, 2 to two, etc, but this might get you started in the right direction. And I use the term right very loosely.
The way I'd do this is to use awk to write an awk program to process the larger file:
awk -f <(awk '
BEGIN{print " BEGIN{"}
{printf " a[\"%s\"]=\"%s\";",$1,$2}
END {print " }";
print " {$1=a[$1];print $0}"}
' lookup_file.txt
) data.txt
That assumes that the id column is column 1; if not, you need to change both instances of $1 in $1=a[$1]

CSV grep but keep the header

I have a CSV file that look like this:
A,B,C
1,2,3
4,4,4
1,2,6
3,6,9
Is there an easy way to grep all the rows in which the B column is 2, and keep the header? For example, I want the output be like
A,B,C
1,2,3
1,2,6
I am working under linux
Using awk:
awk -F, 'NR==1 || $2==2' file
NR==1 -> if first line,
$2==2 -> if second column is equal to 2. Lines are printed if either of the above is true.
To choose the column using the header column name:
awk -F, -v col="B" 'NR==1{for(i=1;i<=NF;i++)if($i==col)break;print;next}$i==2' file
Replace B with the appropriate name of the column which you want to check against.
You can use addresses in sed:
sed -n '1p;/^[^,]*,2/p'
It means:
1p Print the first line.
/ Start a match.
^ Match the beginnning of a line.
[^,] Match anything but a comma
* zero or more times.
, Match a comma.
2 Match a 2.
/p End of match, if it matches, print.
If the header can contain the value you are looking for, you should be more careful:
sed -n '1p;1!{/^[^,]*,2/p}'
1!{ ... } just means "Do the following for lines other then the first one".
For column number n>2, you can add a quantifier:
sed -n '1p;1!{/^\([^,]*,\)\{M\}2/p}'
where M=n-1. The quantifier just means repetition, so the non-comma-0-or-more-times-comma thing is repeated M times.
For true CSV files where a value can contain a comma, switch to Perl and Text::CSV.
$ awk -F, 'NR==1 { for (i=1;i<=NF;i++) h[$i] = i; print; next } $h["B"] == 2' file
A,B,C
1,2,3
1,2,6
By the way, sed is an excellent tool for simple substitutions on a single line, for anything else, just use awk - the code will be clearer and MUCH easier to enhance in future if necessary.

Resources