Sed - Conditional Matching of pattern - linux

I want to do the following:
Find pattern 1, then find the first instance of pattern 2. After doing so, I want to print the next line. This is for a sed script. I'm pretty lost on how to do this, since sed doesn't have if statements.

This might work for you (GNU sed):
sed -n '/first/,${/second/{n;p;q}}' file
Set -n option to emulate grep i.e. only print what you want. Focus on the range from first to the end of the file ($). Then match second and get the next line (n), print (p) and quit (q).

If filename j.txt contains below content:
10 20 30
40 50 60
10 90 80
sed -n '/10/p' j.txt | sed -n '/20/,+1p'
First it will search for pattern1 (10) and then it will search for pattern2 (20) and print corresponding next line with content match line
Output will be:
10 20 30
10 90 80

Related

sed replacing first occurence of characters in each line of file only if they are first 2 characters

Is it possible using sed to replace the first occurrence of a character or substring in line of file only if it is the first 2 characters in the line?
For example we have this text file:
15 hello
15 h15llo
1 hello
1 h15loo
Using the following command: sed -i 's/15/0/' file.txt
Will give this output
0 hello
0 h15llo
1 hello
1 h0loo
What I am trying to avoid is it considering the characters past the first 2.
Is this possible?
Desired output:
0 hello
0 h15llo
1 hello
1 h15loo
You can use
sed -i 's/^15 /0 /' file.txt
sed -i 's/^15\([[:space:]]\)/0\1/' file.txt
sed -i 's/^15\(\s\)/0\1/' file.txt
Here, the ^ matches the start of string position, 15 matches the 15 substring and then a space matches a space.
The second and third solutions are the same, instead of a literal space, they capture a whitespace char into Group 1 and the group value is put back into the result using the \1 placeholder.

vi sed or awk. every line in a text file. replace 9 characters starting at position 75

I have a huge file
from line 3 to end of (#lines in file -1 )
starting at character position 75 on the line. I need to change the string to 123456789.
thought suggestions? I can't the existing characters per line are not duplicates so I can't search on that.
The joys of hiding pii data
In vim, you can do this:
%s/\(^.\{75\}\)\#<=........./1234567890/g
which basically does a lookbehind of 75 characters (which starts at the beginning of the line), and replaces the rest of the line with your string.
Let's consider this test file:
$ cat testfile
.........-.........-.........-.........-.........-.........-.........-....ReplaceMeKeep
.........-.........-.........-.........-.........-.........-.........-....OldData..Keep
Using sed
This replaces the nine characters starting with column 75 on with 123456789:
$ sed -E 's/(.{74}).{0,9}/\1123456789/' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
Using awk
This puts the new string in place of the first nine characters starting at position 75:
$ awk '{print substr($0,1,74) "123456789" substr($0,75+9)}' testfile
.........-.........-.........-.........-.........-.........-.........-....123456789Keep
.........-.........-.........-.........-.........-.........-.........-....123456789Keep

grep to search data in first column

I have a text file with two columns.
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
I want to search for product starts from "Abc" and ends with "def" then for those entries I want to add Cost.
I have used :
grep "^Abc|def$" myfile
but it is not working
Use awk. cat myfile | awk '{print $1}' | grep query
If you can use awk, try this:
text.txt
--------
Product Cost
Abc....def 10
Abc.def 20
ajsk,,lll 04
With only awk:
awk '$1 ~ /^Abc.*def$/ { SUM += $2 } END { print SUM } ' test.txt
Result: 30
With grep and awk:
grep "^Abc.*def.*\d*$" test.txt | awk '{SUM += $2} END {print SUM}'
Result: 30
Explanation:
awk reads each line and matches the first column with a regular expression (regex)
The first column has to start with Abc, followed by anything (zero or more times), and ends with def
If such match is found, add 2nd column to SUM variable
After reading all lines print the variable
Grep extracts each line that starts with Abc, followed by anything, followed by def, followed by anything, followed by a number (zero or more times) to end. Those lines are fed/piped to awk. Awk just increments SUM for each line it receives. After reading all lines received, it prints the SUM variable.
Thanks edited. Do you want the command like this?
grep "^Abc.*def *.*$"
If you don't want to use cat, and also show the line numbers:
awk '{print $1}' filename | grep -n keyword
If applicable, you may consider caret ^: grep -E '^foo|^bar' it will match text at the beginning of the string. Column one is always located at the beginning of the string.
Regular expression > POSIX basic and extended
^ Matches the starting position within the string. In line-based tools, it matches the starting position of any line.

How can I remove double line breaks with sed?

I tried:
sed -i 's/\n+/\n/' file
but it's not working.
I still want single line breaks.
Input:
abc
def
ghi
jkl
Desired output:
abc
def
ghi
jkl
This might work for you (GNU sed):
sed '/^$/{:a;N;s/\n$//;ta}' file
This replaces multiple blank lines by a single blank line.
However if you want to place a blank line after each non-blank line then:
sed '/^$/d;G' file
Which deletes all blank lines and only appends a single blank line to a non-blank line.
Sed isn't very good at tasks that examine multiple lines programmatically. Here is the closest I could get:
$ sed '/^$/{n;/^$/d}' file
abc
def
ghi
jkl
The logic of this: if you find a blank line, look at the next line. If that next line is also blank, delete that next line.
This doesn't gobble up all of the lines in the end because it assumes that there was an intentional extra pair and reduced the two \n\ns down to two \ns.
To do it in basic awk:
$ awk 'NF > 0 {blank=0} NF == 0 {blank++} blank < 2' file
abc
def
ghi
jkl
This uses a variable called blank, which is zero when the number of fields (NF) is nonzero and increments when they are zero (a blank line). Awk's default action, printing, is performed when the number of consecutive blank lines is less than two.
Using awk (gnu or BSD) you can do:
awk -v RS= -v ORS='\n\n' '1' file
abc
def
ghi
jkl
Also using perl:
perl -pe '$/=""; s/(\n)+/$1$1/' file
abc
def
ghi
jkl
Found here That's What I Sed (slower than this solution).
sed '/^$/N;/\n$/D' file
The sed script can be read as follows:
If the next line is empty, delete the current line.
And can be translated into the following pseudo-code (for the reader already familiar with sed, buffer refers to the pattern space):
1 | # sed '/^$/N;/\n$/D' file
2 | while not end of file :
3 | buffer = next line
4 | # /^$/N
5 | if buffer is empty : # /^$/
6 | buffer += "\n" + next line # N
7 | end if
8 | # /\n$/D
9 | if buffer ends with "\n" : # /\n$/
10 | delete first line in buffer and go to 5 # D
11 | end if
12 | print buffer
13 | end while
In the regular expression /^$/, the ^ and $ signs mean "beginning of the buffer" and "end of the buffer" respectively. They refer to the edges of the buffer, not to the content of the buffer.
The D command performs the following tasks: if the buffer contains newlines, delete the text of the buffer up to the first newline, and restart the program cycle (go back to line 1) without processing the rest of the commands, without printing the buffer, and without reading a new line of input.
Finally, keep in mind that sed removes the trailing newline before processing the line, and keep in mind that the print command adds back the trailing newline. So, in the above code, if the next line to be processed is Hello World!\n, then next line implicitely refers to Hello World!.
More details at https://www.gnu.org/software/sed/manual/sed.html.
You are now ready to apply the algorithm to the following file:
a\n
b\n
\n
\n
\n
c\n
Now let's see why this solution is faster.
The sed script /^$/{:a;N;s/\n$//;ta} can be read as follows:
If the current line matches /^$/, then do {:a;N;s/\n$//;ta}.
Since there is nothing between ^ and $ we can rephrase like this:
If the current line is empty, then do {:a;N;s/\n$//;ta}.
It means that sed executes the following commands for each empty line:
Step
Command
Description
1
:a
Declare a label named "a".
2
N
Append the next line preceded by a newline (\n) to the current line.
3
s/\n$//
Substitute (s) any trailing newline (/\n$/) with nothing (//).
4
ta
Return to label "a" (to step 1) if a substitution was performed (at step 3), otherwise print the result and move on to the next line.
Non empty lines are just printed as is. Knowing all this, we can describe the entire procedure with the following pseudo-code:
1 | # sed '/^$/{:a;N;s/\n$//;ta}' file
2 | while not end of file :
3 | buffer = next line
4 | # /^$/{:a;N;s/\n$//;ta}
5 | if buffer is empty : # /^$/
6 | :a # :a
7 | buffer += "\n" + next line # N
8 | if buffer ends with "\n" : # /\n$/
9 | remove last "\n" from buffer # s/\n$//
10 | go to :a (at 6) # ta
11 | end if
12 | end if
13 | print buffer
14 | end while
As you can see, the two sed scripts are very similar. Indeed, s/\n$//;ta is almost the same as /\n$/D. However, the second script skips step 5, so it is potentialy faster than the first script. Let's time both scripts fed with ~10Mb of empty lines:
$ yes '' | head -10000000 > file
$ /usr/bin/time -f%U sed '/^$/N;/\n$/D' file > /dev/null
3.61
$ /usr/bin/time -f%U sed '/^$/{:a;N;s/\n$//;ta}' file > /dev/null
2.37
Second script wins.
perl -00 -pe 1 filename
That splits the input file into "paragraphs" separated by 2 or more newlines, and then prints the paragraphs separated by a single blank line:
perl -00 -pe 1 <<END
abc
def
ghi
jkl
END
abc
def
ghi
jkl
This gives you what you want using solely sed :
sed '/^$/d' txt | sed -e $'s/$/\\\n/'
The first sed command removes all empty lines, denoted as "^$".
The second sed command inserts one newline character at the end of each line.
Why not just get rid of all your blank lines, then add a single blank line after each line? For an input file tmp as you specified,
sed '/^$/d' tmp|sed '0~1 a\ '
abc
def
ghi
jkl
If white space (spaces and tabs) counts as a "blank" line for you, then use sed '/^\s*$/d' tmp|sed '0~1 a\ ' instead.
Note that these solutions do leave a trailing blank line at the end, as I wasn't sure if this was desired. Easily removed.
I wouldn't use sed for this but cat with the -s flag.
As the manual states:
-s, --squeeze-blank suppress repeated empty output lines
So all that is needed to get the desired output is:
cat -s file

Extract certain text from each line of text file using UNIX or perl

I have a text file with lines like this:
Sequences (1:4) Aligned. Score: 4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...
I want to be able to extract the values into a new tab delimited text file:
1 4 4
100 3011 77
12 345 100
(like this but with tabs instead of spaces)
Can anyone suggest anything? Some combination of sed or cut maybe?
You can use Perl:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'
Or, to save to file:
cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt
Little explanation:
Regex here is in the form:
s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/
How to match = .*?(\d+):(\d+).*?(\d+)
How to replace = $1\t$2\t$3
In our case, we used the following tokens to declare how we want to match the string:
.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit
Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:
.*?(\d+):(\d+).*?(\d+)
$1 $2 $3
Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):
$1\t$2\t$3
You could use sed:
sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile
Here's a BSD sed compatible version:
sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile
The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.
If you know there will always be 3 numbers per line, you could go with grep:
<infile grep -o '[0-9]\+' | paste - - -
Output in all cases:
1 4 4
100 3011 77
12 345 100
My solution using sed:
sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1 \2 \3/g' file.txt

Resources