How to grep a file and output the matching part of a line plus a few surrounding words? - linux

I am searching a file for a pattern and would like to limit the output so that it does not display the whole line, but a match surrounded by a few words, so I can see the context. The lines are too long to comfortably view the whole line in the output. I'm looking for a solution with grep, awk, and/or sed. grep has -o option, and it might be possible to use that if I have the right regular expression for that.
As an extra feature, it would be nice if the solution would optionally support grep's line number feature, so that line numbers could be printed along with the output when desired.
UPDATE:
Here is a test file:
1 2 3 4 5 abc 1 2 3 4
abc
1 2 abc
abc 1
1 abc 1
1 2 3 abc 1 2 3
1 2 3 4 abc 1
1 2 3 4 5 6
1 2 3 4 5
1 2 3
1
SOLUTION:
Changing the number of minimum words to zero, so that we do not miss matches of keyword not surrounded by any words:
egrep -no '(\w+ ){0,3}keyword( \w+){0,2}' file
Example:
egrep -no '(\w+ ){0,3}abc( \w+){0,2}' test.txt
Output:
1:3 4 5 abc 1 2
2:abc
3:1 2 abc
4:abc 1
5:1 abc 1
6:1 2 3 abc 1 2
7:2 3 4 abc 1

I believe you're looking for something like:
egrep -no '(\w+ ){1,3}keyword( \w+){1,2}' file
This will print lines containing the word 'keyword' with a line number prefix. It will print up to three words before the match and up to two words after the match.
\w will match any single character classified as a "word" character (alphanumeric or _).
This answer also assumes words a separated by a single space character.

Related

How to remove some words in specific field using awk?

I have several lines of text. I want to extract the number after specific word using awk.
I tried the following code but it does not work.
At first, create the test file by: vi test.text. There are 3 columns (the 3 fields are generated by some other pipeline commands using awk).
Index AllocTres CPUTotal
1 cpu=1,mem=256G 18
2 cpu=2,mem=1024M 16
3 4
4 cpu=12,gres/gpu=3 12
5 8
6 9
7 cpu=13,gres/gpu=4,gres/gpu:ret6000=2 20
8 mem=12G,gres/gpu=3,gres/gpu:1080ti=1 21
Please note there are several empty fields in this file.
what I want to achieve only keep the number folloing the first gres/gpu part and remove all cpu= and mem= parts using a pipeline like: cat test.text | awk '{some_commands}' to output 3 columns:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
1st solution: With your shown samples, please try following GNU awk code. This takes care of spaces in between fields.
awk '
FNR==1{ print; next }
match($0,/[[:space:]]+/){
space=substr($0,RSTART,RLENGTH-1)
}
{
match($2,/gres\/gpu=([0-9]+)/,arr)
match($0,/^[^[:space:]]+[[:space:]]+[^[:space:]]+([[:space:]]+)/,arr1)
space1=sprintf("%"length($2)-length(arr[1])"s",OFS)
if(NF>2){ sub(OFS,"",arr1[1]);$2=space arr[1] space1 arr1[1] }
}
1
' Input_file
Output will be as follows for above code with shown samples:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
2nd solution: If you don't care of spaces then try following awk code.
awk 'FNR==1{print;next} match($2,/gres\/gpu=([0-9]+)/,arr){$2=arr[1]} 1' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==1{ ##Checking condition if this is first line then do following.
print ##Printing current line.
next ##next will skip all further statements from here.
}
match($2,/gres\/gpu=([0-9]+)/,arr){ ##using match function to match regex gres/gpu= digits and keeping digits in capturing group.
$2=arr[1] ##Assigning 1st value of array arr to 2nd field itself.
}
1 ##printing current edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Using sed
$ sed 's~\( \+\)[^,]*,\(gres/gpu=\([0-9]\)\|[^ ]*\)[^ ]* \+~\1\3 \t\t\t\t ~' input_file
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
awk '
FNR>1 && NF==3 {
n = split($2, a, ",")
for (i=1; a[i] !~ /gres\/gpu=[0-9]+,?/ && i<=n; ++i);
sub(/.*=/, "", a[i])
$2 = a[i]
}
NF==2 {$3=$2; $2=""}
{printf "%-7s%-11s%s\n",$1,$2,$3}' test.txt
Output:
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
You can adjust column widths as desired.
This assumes the first and last columns always have a value, so that NF (number of fields) can be used to identify field 2. Then if field 2 is not empty, split that field on commas, scan the resulting array for the first match of gres/gpu, remove this suffix, and print the three fields. If field 2 is empty, the second last line inserts an empty awk field so printf always works.
If assumption above is wrong, it's also possible to identify field 2 by its character index.
A awk-based solution without needing
- array splitting,
- regex back-referencing,
- prior state tracking, or
- input multi-passing
—- since m.p. for /dev/stdin would require state tracking
|
{mng}awk '!_~NF || sub("[^ ]+$", sprintf("%*s&", length-length($!(NF=NF)),_))' \
FS='[ ][^ \\/]*gres[/]gpu[=]|[,: ][^= ]+[=][^,: ]+' OFS=
Index AllocTres CPUTotal
1 18
2 16
3 4
4 3 12
5 8
6 9
7 4 20
8 3 21
If you don't care for nawk, then it's even simpler single-pass approach with only 1 all-encompassing call to sub() per line :
awk ' sub("[^ ]*$", sprintf("%*s&", length($_) - length($(\
gsub(" [^ /]*gres[/]gpu=|[,: ][^= ]+=[^,: ]+", _)*_)),_))'
or even more condensed but worse syntax styling :
awk 'sub("[^ ]*$",sprintf("%*s&",length^gsub(" [^ /]*gres\/gpu=|"\
"[,: ][^= ]+=[^,: ]+",_)^_ - length,_) )'
This might work for you (GNU sed):
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*\n(.*)\n.*/\1/
/gpu=/!{s/./ /g;G;s/(^.*)\n(.*)\n.*\n/\2\1/p;d}
s/gpu=([^,]*)/\n\1 \n/;s/(.*)\n(.*\n)/\2\1/;H
s/.*\n//;s/./ /g;H;g
s/\n.*\n(.*)\n(.*)\n.*\n(.*)/\2\3\1/' file
In essence the solution above involves using the hold space (see here and eventually here) as a scratchpad to hold intermediate results. Those results are gathered by isolating the second field and then again the gpu info. The step by step story follows:
If the line does not contain a second field, leave alone.
Surround the second field by newlines and make a copy.
Isolate the second field
If the second field contains no gpu info, replace the entire field by spaces and using the copy, format the line accordingly.
Otherwise, isolate the gpu info, move it to the front of the line and append that to the copy of the line in the hold space.
Meanwhile, remove the gpu info from the pattern space and replace each character in the pattern space by a space.
Apend these spaces to the copy and then overwrite the pattern space by the copy.
Lastly, knowing each part of the line has been split by newlines, reassemble the parts into the desired format.
N.B. The solution depends on the spacing of columns being real spaces. If there are tabs in the file, then prepend the sed command s/\t/ /g (where in the example tabs are replaced by 8 spaces).
Alternative:
sed -E '/=/!b
s/\S+/\n&\n/2;h
s/.*(\n.*)\n.*/\1/;s/(.)(.*gpu=)([^,]+)/\3\1\2/;H
s/.*\n//;s/./ /g;G
s/(.*)\n(.*)\n.*\n(.*)\n(.*)\n.*$/\2\4\1\3/' file
In this solution, rather than treat lines with a second field but no gpu info, as a separate case, I introduce a place holder for this missing info and follow the same solution as if gpu info was present.

How to replace a number to another number in a specific column using awk

This is probably basic but I am completely new to command-line and using awk.
I have a file like this:
1 RQ22067-0 -9
2 RQ34365-4 1
3 RQ34616-4 1
4 RQ34720-1 0
5 RQ14799-8 0
6 RQ14754-1 0
7 RQ22101-7 0
8 RQ22073-1 0
9 RQ30201-1 0
I want the 0s to change to 1 in column3. And any occurence of 1 and 2 to change to 2 in column3. So essentially only changing numbers in column 3. But I am not changing the -9.
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
I have tried using (see below) but it has not worked
>> awk '{gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
>> awk '{gsub("1","2",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
Thank you.
With this code in your question:
awk '{gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
awk '{gsub("1","2",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
you're running both commands on the same input file and writing their
output to the same output file so only the output of the 2nd script
will be present in the output, and
you're trying to change 0 to 1
first and THEN change 1 to 2 so the $3s that start out as 0 would
end up as 2, you need to change the order of the operations.
This is what you should be doing, using your existing code:
awk '{gsub("1","2",$3); gsub("0","1",$3)}1' PRS_with_minus9.pheno.txt > PRS_with_minus9_modified.pheno
For example:
$ awk '{gsub("1","2",$3); gsub("0","1",$3)}1' file
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
The gsub() should also just be sub()s as you only want to perform each substitution once, and you don't need to enclose the numbers in quotes so you could just do:
awk '{sub(1,2,$3); sub(0,1,$3)}1' file
You can check the value of column 3 and then update the field value.
Check for 1 as the first rule because if the first check is for 0, the value will be set to 1 and the next check will set the value to 2 resulting in all 2's.
awk '
{
if($3==1) $3 = 2
if($3==0) $3 = 1
}
1' file
Output
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
With your shown samples and ternary operators try following code. Simple explanation would be, checking condition if 3rd field is 1 then set it to 2 else check if its 0 then set it to 0 else keep it as it is, finally print the line.
awk '{$3=$3==1?2:($3==0?1:$3)} 1' Input_file
Generic solution: Adding a Generic solution here, where we can have 3 awk variables named: fieldNumber in which you could mention all field numbers which we want to check for. 2nd one is: existValue which we want to match(in condition) and 3rd one is: newValue new value which needs to be there after replacement.
awk -v fieldNumber="3" -v existValue="1,0" -v newValue="2,1" '
BEGIN{
num=split(fieldNumber,arr1,",")
num1=split(existValue,arr2,",")
num2=split(newValue,arr3,",")
for(i=1;i<=num1;i++){
value[arr2[i]]=arr3[i]
}
}
{
for(i=1;i<=num;i++){
if($arr1[i] in value){
$arr1[i]=value[$arr1[i]]
}
}
}
1
' Input_file
This might work for you (GNU sed):
sed -E 's/\S+/\n&\n/3;h;y/01/12/;G;s/.*\n(.*)\n.*\n(.*)\n.*\n.*/\2\1/' file
Surround 3rd column by newlines.
Make a copy.
Replace all 0's by 1's and all 1's by 2's.
Append the original.
Pattern match on newlines and replace the 3rd column in the original by the 3rd column in the amended line.
Also with awk:
awk 'NR > 1 {s=$3;sub(/1/,"2",s);sub(/0/,"1",s);$3=s} 1' file
1 RQ22067-0 -9
2 RQ34365-4 2
3 RQ34616-4 2
4 RQ34720-1 1
5 RQ14799-8 1
6 RQ14754-1 1
7 RQ22101-7 1
8 RQ22073-1 1
9 RQ30201-1 1
the substitutions are made with sub() on a copy of $3 and then the copy with the changes is assigned to $3.
When you don't like the simple
sed 's/1$/2/; s/0$/1/' file
you might want to play with
sed -E 's/(.*)([01])$/echo "\1$((\2+1))"/e' file

Bash code to struture proteomics data

I need help concerning retructuring my dataset so that I can perform the downstream analysis. I am presently dealing with proteomics data and want to perform comparative analysis. The problem is the protein ids. In general one protein can have more then 1 id and they are separated by ";". I need to print the entire line of the same protein with different protein ids. for example:-
Input file :
tom dick harry jan
a;b;c 1 2 3 4
d;e 4 5 7 3
desirable output:
tom dick harry jan
a 1 2 3 4
b 1 2 3 4
c 1 2 3 4
d 4 5 7 3
e 4 5 7 3
many many thanks in advance
$ awk 'NR==1{$0="key "$0} {split($1,a,/;/); for (i=1; i in a; i++) { $1=a[i]; print } }' file | column -t
key tom dick harry jan
a 1 2 3 4
b 1 2 3 4
c 1 2 3 4
d 4 5 7 3
e 4 5 7 3
You can trivially remove the word "key" from the output if you don't like it but IMHO having some columns with and some without headers is a very bad idea - just makes any further processing more difficult.
#!/bin/bash
read header
printf "%4s %s\n" "" "$header"
while true
do
read ids values
for id in $(tr ';' ' ' <<< "$ids")
do
printf "%-4s %s\n" "$id" "$values"
done
done
This reads the header and prints is (just slightly differently formatted), then it reads each line and prints for each of these a bunch of lines, one line for each id given in the beginning of the line. For finding the ids, the ids string is split over semicolon (;).

Adding new line to file with sed

I want to add a new line to the top of a data file with sed, and write something to that line.
I tried this as suggested in How to add a blank line before the first line in a text file with awk :
sed '1i\
\' ./filename.txt
but it printed a backslash at the beginning of the first line of the file instead of creating a new line. The terminal also throws an error if I try to put it all on the same line ("1i\": extra characters after \ at the end of i command).
Input :
1 2 3 4
1 2 3 4
1 2 3 4
Expected output
14
1 2 3 4
1 2 3 4
1 2 3 4
$ sed '1i\14' file
14
1 2 3 4
1 2 3 4
1 2 3 4
but just use awk for clarity, simplicity, extensibility, robustness, portability, and every other desirable attribute of software:
$ awk 'NR==1{print "14"} {print}' file
14
1 2 3 4
1 2 3 4
1 2 3 4
Basially you are concatenating two files. A file containing one line and the original file. By it's name this is a task for cat:
cat - file <<< 'new line'
# or
echo 'new line' | cat - file
while - stands for stdin.
You can also use cat together with command substitution if your shell supports this:
cat <(echo 'new line') file
Btw, with sed it should be simply:
sed '1i\new line' file

missing number from two squence

How do I findout missing number from two sequence using bash script
from example I have file which contain following data
1 1
1 2
1 3
1 5
2 1
2 3
2 5
output : missing numbers are
1 4
2 2
2 4
This awk one-liner gives the requested output for the specified input:
$ awk '$2!=l2+1&&$1==l1{for(i=l2+1;i<$2;i++)print l1,i}{l1=$1;l2=$2}' file
1 4
2 2
2 4
a solution using grep:
printf "%s\n" {1..2}" "{1..5} | grep -vf file

Resources