Understanding the sed N command - linux

The sed manual states about the N command:
N
Add a newline to the pattern space, then append the next line of input to the pattern space. If there is no more input then sed exits without processing any more commands.
Now, from what I know, sed reads each line of input, applies the script(s) on it, and (if -n is not specified) prints it as the output stream.
So, given the following example file:
Apple
Banana
Grapes
Melon
Running this:
$ sed '/Banana/N;p'
From what I understand, sed should process each line of input: Apple, Banana, Grapes and Melon.
So, I would think that the output will be:
Apple
Apple
Banana
Grapes # since it read the next line with N
Banana
Grapes
Grapes (!)
Grapes (!)
Melon
Melon
Explanation:
Apple is read to the pattern space. it doesn't match Banana regex, so only p is applied. It's printed twice: once for the p command, and once because sed prints the pattern space by default.
Next, Banana is read to the pattern space. It matches the regex, so that the N command is applied: so it reads the next line Grapes to the pattern space, and then p prints it: Banana\nGrapes. Next, the pattern space is printed again due to the default behavior.
Now, I would expect that Grapes will be read to the pattern space, so that Grapes will be printed twice, same as for Apple and Melon.
But in reality, this is what I get:
Apple
Apple
Banana
Grapes
Banana
Grapes
Melon
Melon
It seems that once Grapes was read as part of the N command that was applied to Banana, it will no longer be read as a line of its own.
Is that so? and if so, why isn't it emphasized in the docs?

This might explain it (GNU sed):
sed '/Banana/N;p' file --d
SED PROGRAM:
/Banana/ N
p
INPUT: 'file' line 1
PATTERN: Apple
COMMAND: /Banana/ N
COMMAND: p
Apple
END-OF-CYCLE:
Apple
INPUT: 'file' line 2
PATTERN: Banana
COMMAND: /Banana/ N
PATTERN: Banana\nGrapes
COMMAND: p
Banana
Grapes
END-OF-CYCLE:
Banana
Grapes
INPUT: 'file' line 4
PATTERN: Melon
COMMAND: /Banana/ N
COMMAND: p
Melon
END-OF-CYCLE:
Melon
Where --d is short for --debug
You will see the INPUT: lines go 1,2,4 because the second cycle also grabs input line 3 with the N command.

To debug this, I added = to your script so that you can see what's being emitted at each iteration. The line numbers conveniently demarcate the output from each. Then to identify the default print action at the end of each iteration, I added s/.*/==&==/ so you can see what was printed by sed because you did not specify -n.
sed '/Banana/N;=;p;=;s/.*/==&==/' <<\:
> Apple
> Banana
> Grapes
> Melon
> :
1
Apple
1
==Apple==
3
Banana
Grapes
3
==Banana
Grapes==
4
Melon
4
==Melon==
So, the pattern space containing Banana and Grapes was printed twice, and the first and last lines were printed in isolation twice.

Related

diff 2 files with an output that does not include extra lines

I have 2 files test and test1 and I would like to do a diff between them without the output having extra characters 2a3, 4a6, 6a9 as shown below.
mangoes
apples
banana
peach
mango
strawberry
test1:
mangoes
apples
blueberries
banana
peach
blackberries
mango
strawberry
star fruit
when I diff both the files
$ diff test test1
2a3
> blueberries
4a6
> blackberries
6a9
> star fruit
How do I get the output as
$ diff test test1
blueberries
blackberries
star fruit
A solution using comm:
comm -13 <(sort test) <(sort test1)
Explanation
comm - compare two sorted files line by line
With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files
As we only need the lines unique to the second file test1, -13 is used to suppress the unwanted columns.
Process Substitution is used to get the sorted files.
You can use grep to filter out lines that are not different text:
$ diff file1 file2 | grep '^[<>]'
> blueberries
> blackberries
> star fruit
If you want to remove the direction indicators that indicate which file differs, use sed:
$ diff file1 file2 | sed -n 's/^[<>] //p'
blueberries
blackberries
star fruit
(But it may be confusing to not see which file differs...)
You can use awk
awk 'NR==FNR{a[$0];next} !($0 in a)' test test1
NR==FNR means currently first file on the command line (i.e. test) is being processed,
a[$0] keeps each record in array named a,
next means read next line without doing anything else,
!($0 in a) means if current line does not exist in a, print it.

Linux command or/and script for duplicate lines retrieval

I would like to know if there's an easy way way to locate duplicate lines in a text file that contains many entries (about 200.000 or more) and output a file with the duplicates' line numbers, keeping the source file intact. For instance, I got a file with tweets like this:
1. i got red apple
2. i got red apple in my stomach
3. i got green apple
4. i got red apple
5. i like blue bananas
6. i got red apple
7. i like blues music
8. i like blue bananas
9. i like blue bananas
I want the output to be a separate file like this:
4
6
8
9
where numbers will indicate the lines with duplicate entries (excluding the first occurrence of the duplicates). Also note that the matching pattern must be exactly the same sentence (like line 1 is different than line 2, 5 is different than 7 and so on).
Everything I could find with sort | uniq doesn't seem to match the whole sentence but only the first word of the sentence so I'm considering if an awk script would be better for this task or if there is another type of command that can do that.
I also need the first file to be intact (not sorted or reordered in any way) and get only the line numbers as shown above because I want to manually delete these lines from two files. The first file contains the tweets and the second the hashtags of these tweets, so I want to delete the lines that contain duplicate tweets in both files, keeping the first occurrence.
You can try this awk:
awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
As per comment,
awk '$0 in a{print NR} {a[$0]++}' file
Output:
$ awk '$0 in a && a[$0]==1{print NR} {a[$0]++}' file
4
8
$ awk '$0 in a{print NR} {a[$0]++}' file
4
6
8
9
you could use python script for doing the same.
f = open("file")
lines = f.readlines()
count = len (lines)
i=0
ignore = []
for i in range(count):
if i in ignore:
continue
for j in range(count):
if (j<= i):
continue
if lines[i] == lines[j]:
ignore.append(j)
print j+1
output :
4
6
8
9
Here is a method combining a few command line tools:
nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' |
cut -f 1
This
numbers the lines with nl, left adjusted with no leading zeroes (-n ln)
sorts them (ignoring the the first field, i.e., the line number) with sort
finds duplicate lines, ignoring the first field with uniq; the --all-repeated=prepend adds an empty line before each group of duplicate lines
removes all the empty lines and the first one of each group of duplicates with sed
removes everything but the line number with cut
This is what the output looks like at the different stages:
$ nl -n ln file
1 i got red apple
2 i got red apple in my stomach
3 i got green apple
4 i got red apple
5 i like blue bananas
6 i got red apple
7 i like blues music
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2
3 i got green apple
1 i got red apple
4 i got red apple
6 i got red apple
2 i got red apple in my stomach
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
7 i like blues music
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend
1 i got red apple
4 i got red apple
6 i got red apple
5 i like blue bananas
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}'
4 i got red apple
6 i got red apple
8 i like blue bananas
9 i like blue bananas
$ nl -n ln file | sort -k 2 | uniq -f 1 --all-repeated=prepend | sed '/^$/{N;d}' | cut -f 1
4
6
8
9

Adding new line to file with sed

I want to add a new line to the top of a data file with sed, and write something to that line.
I tried this as suggested in How to add a blank line before the first line in a text file with awk :
sed '1i\
\' ./filename.txt
but it printed a backslash at the beginning of the first line of the file instead of creating a new line. The terminal also throws an error if I try to put it all on the same line ("1i\": extra characters after \ at the end of i command).
Input :
1 2 3 4
1 2 3 4
1 2 3 4
Expected output
14
1 2 3 4
1 2 3 4
1 2 3 4
$ sed '1i\14' file
14
1 2 3 4
1 2 3 4
1 2 3 4
but just use awk for clarity, simplicity, extensibility, robustness, portability, and every other desirable attribute of software:
$ awk 'NR==1{print "14"} {print}' file
14
1 2 3 4
1 2 3 4
1 2 3 4
Basially you are concatenating two files. A file containing one line and the original file. By it's name this is a task for cat:
cat - file <<< 'new line'
# or
echo 'new line' | cat - file
while - stands for stdin.
You can also use cat together with command substitution if your shell supports this:
cat <(echo 'new line') file
Btw, with sed it should be simply:
sed '1i\new line' file

vimdiff: force line-by-line comparison (ignore supposedly missing/additional lines)

How do I force vimdiff to always compare two files line-by-line without identifying added or deleted lines?
The problem is that if the diff between two files is large, but by chance two lines in the file match up, vimdiff thinks these lines are the same and just treats the rest as added or deleted lines, and the resulting diff is totally unusable. In my case, line i in file1 always corresponds to line i in file2, so vimdiff has no business finding added or deleted lines.
Following is a small example with two files containing the values of two variables three times each. Vimdiff erroneously matches up file1/line1 with file2/line3 and thinks some lines around it have been added or deleted. The diff (minus colors) then looks like this:
| 1 foo 8.1047 < del/new
| 2 bar 6.2343 < del/new
1 foo 0.0000 | 3 foo 0.0000 < match
2 bar 5.3124 | 4 bar 1.4452 < wrong
3 foo 4.5621 | < new/del
4 bar 6.3914 | < new/del
5 foo 1.0000 | 5 foo 1.0000 < match
6 bar 6.3212 | 6 bar 7.2321 < wrong
What I want, however, is the following, with all lines marked as wrong except for the matching lines 5:
1 foo 0.0000 | 1 foo 8.1047 < wrong
2 bar 5.3124 | 2 bar 6.2343 < wrong
3 foo 4.5621 | 3 foo 0.0000 < wrong
4 bar 6.3914 | 4 bar 1.4452 < wrong
5 foo 1.0000 | 5 foo 1.0000 < match
6 bar 6.3212 | 6 bar 7.2321 < wrong
As I was copying this example to try it, I noticed that vimdiff will do what you want if you have the line number associated with each line.
Therefore, you can use cat to add the line number and then diff:
cat -n file1 > file1_with_line_no
cat -n file2 > file2_with_line_no
vimdiff file1_with_line_no file2_with_line_no
The output is then as you want (shown with diff for easy copying to here):
diff file1_with_line_no file2_with_line_no --side-by-side
1 foo 0.0000 | 1 foo 8.1047
2 bar 5.3124 | 2 bar 6.2343
3 foo 4.5621 | 3 foo 0.0000
4 bar 6.3914 | 4 bar 1.4452
5 foo 1.0000 5 foo 1.0000
6 bar 6.3212 | 6 bar 7.2321
In bash you can add this to your .bashrc so you can use linediff from the command line to just normally call a diff between two files with the above:
linediff() {
if [ -z "$1" ] || [ -z "$2" ]; then return; fi
f1=$(basename "$1")
f2=$(basename "$2")
cat -n "$1" > "/tmp/$f1"
cat -n "$2" > "/tmp/$f2"
vimdiff "/tmp/$f1" "/tmp/$f2"
rm "/tmp/$f1" "/tmp/$f2"
}
and now linediff file1 file2 will do the above and clean up after.
How about using diffchar.vim plugin? It compares line-by-line in non-diff mode. Please open 2 files on 2 windows and then just press F7. By default, it tries to find the differences by characters in a line, but you can change the difference units, words or something.
Vim relies on the external diff command to analyze the two files, so you can influence the result via a different tool that uses a different algorithm. You can configure that via the 'diffexpr' option; the tool's output has to be in "ed" style. Cp. :help diff-diffexpr.
Note that this only affects the lines added / changed / deleted; for displaying the character differences in a changed line itself, Vim does that on its own.
Unfortunately, I don't know any alternative diff tool that could provide such output, but maybe others can fill in that.

sed: How do I replace all lines containing a certain string?

Say I have a file containing these lines:
tomatoes
bananas
tomatoes with apples
pears
pears with apples
How do I replace every line containing the word "apples" with just "oranges"? This is what I want to end up with:
tomatoes
bananas
oranges
pears
oranges
Use sed 's/.*apples.*/oranges/' ?
you can use awk
$ awk '/apples/{$0="orange"}1' file
tomatoes
bananas
orange
pears
orange
this says to search for apples, then change the whole record to "orange".

Resources