Linux Shell - Grep command - linux

I'm having a problem using grep with these options: \{n\} , \{n,\} and \{n,m\} . I have a file named "new" with this lines:
aaaa
aaa
aa
a
When i use grep 'a\{1\}' new i get this output:
aaaa
aaa
aa
a
So, basically, this command will show me the lines that include 1, or more, consecutive occurrences of the character "a" right?
Also, grep 'a\{1,\} new will do the same thing as grep 'a\{1\}' new ? Because i get the same output for both.
The last one, \{n,m\} , i cant really understand what it does.
I would really appreciate if anyone could help me out.

From man grep:
Repetition
A regular expression may be followed by one of several repetition
operators:
? The preceding item is optional and matched at most once.
* The preceding item will be matched zero or more times.
+ The preceding item will be matched one or more times.
{n} The preceding item is matched exactly n times.
{n,} The preceding item is matched n or more times.
{n,m} The preceding item is matched at least n times, but not more
than m times.
That example grep 'a\{2,3\}' new matches also the line with aaaa because of the first three (or 2) a. The rest of the line isn't important.
If you want that really only 2 or 3 consecutive a are matched, you could use the -o flag. But be aware that this would output aa and aaa from a line with aaaaa. To avoid this you have to use additional information, like in the example line breakings ^ and $.
Btw. I would suggest to use the -E flag (or egrep which is the same) so you have extended regex support. You don't have to escape the brackets then.
For input
aaaaa
aaaa
aaa
aa
a
a call of grep -o -E '^a{2,3}$' will give the output:
aaa
aa

grep 'a\{n,m\}' new means grepping at least n number of a and at most m number of a from new.
For example, grep 'a\{2,3\}' new will output
aaaa
aaa
aa
the last line doesnot match because it only has ONE a.
For a{n,\}' , omitting m means any number larger than or equal to n.

Related

Bash script: filter columns based on a character

My text file should be of two columns separated by a tab-space (represented by \t) as shown below. However, there are a few corrupted values where column 1 has two values separated by a space (represented by \s).
A\t1
B\t2
C\sx\t3
D\t4
E\sy\t5
My objective is to create a table as follows:
A\t1
B\t2
C\t3
D\t4
E\t5
i.e. discard the 2nd value that is present after the space in column 1 for eg. in C\sx\t3 I can discard the x that is present after space and store the columns as C\t3.
I have tried a couple of things but with no luck.
I tried to cut the cols based on \t into independent columns and then cut the first column based on \s and join them again. However, it did not work.
Here is the snippet:
col1=(cut -d$'\t' -f1 $file | cut -d' ' -f1)
col2=(cut -d$'\t' -f1 $file)
myArr=()
for((idx=0;idx<${#col1[#]};idx++));do
echo "#{col1[$idx]} #{col2[$idx]}"
# I will append to myArr here
done
The output is appending the list of col2 to the col1 as A B C D E 1 2 3 4 5. And on top of this, my file is very huge i.e. 5,300,000 rows so I would like to avoid looping over all the records and appending them one by one.
Any advice is very much appreciated.
Thank you. :)
And another sed solution:
Search and replace any literal space followed by any number of non-TAB-characters with nothing.
sed -E 's/ [^\t]+//' file
A 1
B 2
C 3
D 4
E 5
If there could be more than one actual space in there just make it 's/ +[^\t]+//' ...
Assuming that when you say a space you mean a blank character then using any awk:
awk 'BEGIN{FS=OFS="\t"} {sub(/ .*/,"",$1)} 1' file
Solution using Perl regular expressions (for me they are easier than seds, and more portable as there are few versions of sed)
$ cat ls
A 1
B 2
C x 3
D 4
E y 5
$ cat ls |perl -pe 's/^(\S+).*\t(\S+)/$1 $2/g'
A 1
B 2
C 3
D 4
E 5
This code gets all non-empty characters from the front and all non-empty characters from after \t
Try
sed $'s/^\\([^ \t]*\\) [^\t]*/\\1/' file
The ANSI-C Quoting ($'...') feature of Bash is used to make tab characters visible as \t.
take advantage of FS and OFS and let them do all the hard work for you
{m,g}awk NF=NF FS='[ \t].*[ \t]' OFS='\t'
A 1
B 2
C 3
D 4
E 5
if there's a chance of leading edge or trailing edge spaces and tabs, then perhaps
mawk 'NF=gsub("^[ \t]+|[ \t]+$",_)^_+!_' OFS='\t' RS='[\r]?\n'

Search for a pattern in Column in a CSV and replace another pattern in the same line using sed command

I want to check for a pattern (only if the pattern starts with) in second column in a CSV file and if that pattern exists then replace something else in same line.
I wrote the following sed command for following csv to change the I to N if the pattern 676 exists in second column. But it checks 676 in the 7th and 9th column also since the ,676 exists. Ideally, I want only the second line to be checked for if the prefix 676 exists. All I want is to check 676 prefixed in second column (pattern not in the middle or end of the second value Ex- 46769777) and then do the change on ,I, to ,N,.
sed -i '/,676/ {; s/,I,/,N,/;}' temp.csc
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,I,TTTT,I,67677,yy
6768880,46769777,S,I,TTTT,I,67677,yy
Expected result required
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,N,TTTT,N,67677,yy
6768880,40999777,S,I,TTTT,I,67677,yy
If you are not bound by sed, awk might be a better option for you. Give this a try :
awk -F"," '{match($2,/^676/)&&gsub(",I",",N")}{print}' temp.csc
match syntax does the matching of second column to numbers that starts with (^) 676. gsub replaces I with N.
Result:
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,N,TTTT,N,67677,yy
6768880,46769777,S,I,TTTT,I,67677,yy
This requires that 676 appear at the beginning of the second column before any changes are made:
$ sed '/^[^,]*,676/ s/,I,/,N,/g' file
6768880,55999777,S,I,TTTT,I,67677,yy
6768880,676999777,S,N,TTTT,N,67677,yy
6768880,46769777,S,I,TTTT,I,67677,yy
Notes:
The regex /^[^,]*,676/ requires that 676 appear after the first appearance of a comma on the line. In more detail:
^ matches the beginning of the line
[^,]* matches the first column
,676 matches the first comma followed by 676
In your desired output, ,I, was replaced with ,N, every time it appeared on the line. To accomplish this, g (meaning global) was added to the substitute command.

Grep find lines that have 4,5,6,7 and 9 in zip code column

I'm using grep to display all lines that have ONLY 4,5,6,7 and 9 in the zipcode column.
How do i display only the lines of the file that contain the numbers 4,5,6,7 and 9 in the zipcode field?
A sample row is:
15 m jagger mick 41 4th 95115
Thanks
I am going to assume you meant "How do I use grep to..."
If all of the lines in the file have a 5 digit zip at the end of each line, then:
egrep "[45679]{5}$" filename
Should give you what you want.
If there might be whitespace between the zip and the end of the line, then:
egrep "[45679]{5}[[:space:]]*$" filename
would be more robust.
If the problem is more general than that, please describe it more accurately.
Following regex should fetch you desired result:
egrep "[45679]+$" file
If by "grep" you mean, "the correct tool", then the solution you seek is:
awk '$7 ~ /^[45679]*$/' input
This will print all lines of input in which the 7th field consists only of the characters 4,5,6,7, and 9. If you want to specify 'the last column' rather than the 7th, try
awk '$NF ~ /^[45679]*$/' input

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Matching only a <tab> that is between two numbers

How to match a tab only when it is between two numbers?
Sample script
209.65834 27.23204908
119.37987 15.03317082
74.240635 8.30561924
29.1014 0
931.8861 -100.00000
-16.03784 -8.30562
;
_mirror
l
;
29.1014 0
1028.10 0.00
n
_spline
935.4875 250
924.2026913 269.8820375
912.9178825 277.4506484
890.348265 287.3181854
(in the above script, the tabs are between the numbers, not the spaces) (blank lines are significant; there is nothing in them, but I can't lose them)
I wish to get a "," between the numbers. Tried with :%s/\t/\,/ but that will touch the empty lines too, and the end of lines.
Try this:
:%s/\(\d\)\t\(-\?\d\)/\1,\2/
\d matches any digit. -? means "an optional -. The pair of (escaped) parenthesis capture the match, and \1 refers to the first captured match, \2 refers to the second.
google://vim+regex -> http://vimregex.com/ ->
:%s/\([0-9]\)\t\([0-9]\)/\1,\2/gc
You have 2 groups of numbers here ([0-9]) and tab-symbols \t between them. Add some escape symbols and you have the answer.
g for multichange in single line, c for some asking.
\1 and \2 are matching groups (numbers in your case).
It's not really hard to find answer for questions like that by yourself.
try
:%s/\([0-9]\)\t\([0-9]\)/\1,\2/g
explanation - search the patten <digit>\t<digit> and remember the part that matches <digit> .
\( ... \) captures and remembers the part that matches.
\1 recalls the first captured digit, \2 the second captured digit.
so if the match was on 123\t789, <digit>,<digit> matches 3\t7
the 3 and 7 are rememberd as \1 and \2
or
:g/[0-9]/ s/\t/,/g
explanation - filter all lines with a digit, then substitute tabs with a comma on those lines

Resources