How to understand diff -u in linux?

How to understand diff -u in linux? - linux

example code
diff -r -u -P a.c b.c > diff.patch
I've tried to search in man.
man says that diff -u is to unify the pattern of output, what is the meaning of that and when should we use it?
thanks a lot.

From Wikipedia (diff utility):
The unified format (or unidiff) inherits the technical improvements made by the context format, but produces a smaller diff with old and new text presented immediately adjacent. Unified format is usually invoked using the "-u" command line option. This output is often used as input to the patch program. Many projects specifically request that "diffs" be submitted in the unified format, making unified diff format the most common format for exchange between software developers.
...
The format starts with the same two-line header as the context format, except that the original file is preceded by "---" and the new file is preceded by "+++". Following this are one or more change hunks that contain the line differences in the file. The unchanged, contextual lines are preceded by a space character, addition lines are preceded by a plus sign, and deletion lines are preceded by a minus sign.
A hunk begins with range information and is immediately followed with the line additions, line deletions, and any number of the contextual lines. The range information is surrounded by double-at signs, and combines onto a single line what appears on two lines in the context format (above). The format of the range information line is as follows:
## -l,s +l,s ## optional section heading
...
The idea of any format that diff throws at you is to transform a source file into a destination file following a series of steps. Let's see a simple example of how this works with unified format.
Given the following files:
from.txt
a
b
to.txt
a
c
The output of diff -u from.txt to.txt is:
--- frokm.txt 2015-03-17 04:34:47.076997087 -0430
+++ to.txt 2015-03-17 04:35:27.872996388 -0430
## -1,2 +1,2 ##
a
-b
+c
Explanation. Header description:
--- from.txt 2015-03-17 22:42:18.575039925 -0430 <-- from-file time stamp
+++ to.txt 2015-03-17 22:42:10.495040064 -0430 <-- to-file time stamp
This diff contains just one hunk (only one set of changes to turn file form.txt into to.txt):
## -1,2 +1,2 ## <-- A hunk, a block describing chages between both files, there could be several of these in the diff -u output
^ ^
| (+) means that this change starts at line 1 and involves 2 lines in the to.txt file
(-) means that this change starts at line 1 and involves 2 lines of the from.txt file
Next, the list of changes:
a <-- This line remains the same in both files, so it won't be changed
-b <-- This line has to be removed from the "from.txt" file to transform it into the "to.txt" file
+c <-- This line has to be added to the "from.txt" file to transform it into the "to.txt" file
Here are some StackOverflow answers with really nice info about this subject:
https://stackoverflow.com/a/10950496/1041822
https://stackoverflow.com/a/2530012/1041822
And some other useful documentation:
https://linuxacademy.com/blog/linux/introduction-using-diff-and-patch/
http://www.artima.com/weblogs/viewpost.jsp?thread=164293

The term unified was made up. Better, perhaps would have been to call it "concise".
The point of diff -u is that it is a more concise representation than context diff. Quoting from the original description of Wayne Davison's posting of unidiff to comp.sources.misc (volume 14, 31 Aug 90):
I've created a new context diff format that combines the old and new chunks into
one unified hunk. The result? The unified context diff, or "unidiff."
Posting your patch using a unidiff will usually cut its size down by around
25% (I've seen from 12% to 48%, depending on how many redundant context lines
are removed). Even if the diffs are generated with only 2 lines of context,
the savings still average around 20%.
Keep in mind that *no information is lost* by the conversion process. Only
the redundancy of having multiple identical context lines. [...]
Here are some useful links:
How to read a patch or diff and understand its structure to apply it manually
What is the format of a patch file?
Not useful (and misleading)
2.2.2 Unified Format, which appears to omit attribution.

Related

Using diff command, ignore character at end of line

I'm not entirely sure what sort of diff command I'd do to match what I need. But basically I have two different directories full of files that I need to compare and outline the changes of. But in one set of files they basically have a '1' at the end of the line.
An example would be if comparing these two objects
File1/1.txt
I AM IDENTICAL
File2/1.txt
I AM IDENTICAL 1
So I'd just want the diff command to leave out the '1' at the end of the line and show me the files which actually have changes. So far I came up with something like
diff file1/ file2/ -rw -I "$1" | more
but that doesn't work.
Apologies if this is an easy obvious question.

If the number of files and/or size is not that much, you can eyeball the differences and simply use vimdiff command to compare two files side by side
vimdiff File1/1.txt File2/1.txt
Otherwise, as arkascha suggested, first you need to modify your files to eliminate the ending character(s) before comparing them.

Using diff for two files and send by email [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have files like below. I use crontab every 5 min to check the files to see if the system's added one file, for example like this: AIR_2015xxxxT0yyyyyyyy.cfg. Then I need to use the diff command automatically between the last one and before the last one.
AIR_20151021T163514000.cfg
AIR_20151026T103845000.cfg
AIR_2015xxxxT0yyyyyyyy.cfg
I want to do this in a script like the one below:
#!/bin/bash
/var/opt/fds/
diff AIR_2015xxxxT0yyyyyyyy.cfg AIR_20151026T103845000.cfg > Test.txt
body(){
cat body.txt
}
(echo -e "$(body)") | -a Test.txt mailx -s 'Comparison' user#email.com

Given a list of files in the directory /var/opt/fds with names in the format:
AIR_YYYYmmddTHHMMSSfff.cfg
where the letter Y represents digits for the year, m for month, d for day, H for hour, M for minute, S for second, and f for fraction (milliseconds), then you need to establish the two most recent files in the directory to compare them.
One way to do this is:
cd /var/opt/fds || exit 1
old=
new=
for file in AIR_20[0-9][0-9]????T?????????.cfg
do
old=$new
new=$file
done
if [ -n "$old" ] && [ -n "$new" ]
then
diff "$old" "$new" > test.txt
mailx -a test.txt -s 'Comparison' user#example.com < body.txt
fi
Note that if the new file has a name containing letters x and y as shown in the question and comments, it will be listed after the names containing the time stamp as digits, so it will be picked up as the new file. It also assumes permission to write in the /var/opt/fds directory, and that the mail body file is present in that directory too. Those assumptions can be trivially fixed if necessary. The test.txt file should be deleted after it is sent, too, and you could check that it is non-empty before sending the email (just in case the two most recent files are in fact identical). You could embed a time-stamp in the generated file name containing the diffs instead of using test.txt:
output="diff.$(date +'%Y%m%dT%H%M%S000').txt"
and then use $output in place of test.txt.
The test ensures that there was both an old and a new name. The pattern match is sloppier than it could be, but using [0-9] or an appropriate subrange ([01], [0-3], [0-2], [0-5]) for the question marks makes the pattern unreadably long:
for file in AIR_20[0-9][0-9][01][0-9][0-3][0-9]T[0-2][0-9][0-5][0-9][0-5][0-9][0-9][0-9][0-9].cfg
It also probably provides very little extra in the way of protection. Of course, as shown, it imposes a Y2.1K crisis on the system, not that it is hard to fix that. You could also cut down the range of valid dates by basing it on today's date, but beware of the end of the year, etc. You might decide you only need entries from the last month or so.
Using globbing is generally better than trying to parse ls or find output. In this context, where the file names have a restricted set of characters in the name (no newlines, no blanks or tabs, no quotes, no dollar signs, etc), it is feasible to use either find or ls — but if you have to deal with arbitrary names created by random end users, those tools are not suitable. (The ls command does all sorts of weird stuff with weird names and basically is hard to use reliably in the face of user cussedness. The find command and its -print0 option can be used, especially if you have a sort that recognizes -z to work with null-terminated 'lines' and an xargs that supports -0 to handle such lines too — but you have to very careful.)
Note that this scheme does not keep a record of the last file analyzed (so if no new files appear for an hour, you might send a dozen copies of the same differences), nor does it directly report on the file names (but using diff -u or diff -c would include the file names being diffed in the output). Again, these issues can be worked around if that's appropriate (and it probably is). Keeping the record of which files have been compared is probably the hardest job; even that's not too bad:
echo "$old" "$new" >> reported.diffs
to record what's been processed; then
if grep -q "$old $new" reported.diffs
then : Already processed
else : Process $old and $new
fi

linux cmp utility output: what is a "line"?

Could someone tell me what the "line" number represents in the output of a cmp command? I ask this because, first, I can't find it explained anywhere. Second, I am getting results comparing a set of files where the "char" outputs are identical (as expected) but the "line" outputs differ wildly.

The "line" outputs reflect the number of newline characters seen prior to that point in the file.
For a file which is not in a textual format, the "line" output is not likely to be meaningful, and can be ignored; for a file which is in textual format, the line number returned could be used in a text editor to navigate to the area with the difference.
Per the POSIX spec for cmp:
For files which are not text files, line numbers simply reflect the presence of a <newline>, without any implication that the file is organized into lines.
Because by default cmp prints only the first difference seen, the line numbers between both files are guaranteed to be identical at that point. When passed -l, cmp continues beyond the first difference -- but no longer prints line numbers, thus avoiding any ambiguity as to which file's line number count is canonical.

how did git handle one of several patches?

Let's say we have 5 patches for a file, and 4 of the patches changed the file content and add new lines. But we can still apply a single patch 5 to the git tree. Why? Since I thought the line numbers has changed, and so the line contents didn't match any more. How does git determine which line I made a change? Through the three line context of the change? I don't think it is plausible.
BTW, how to generate a number zero patch like [PATCH 0/5]? It seems format patch can only generate from 0001.

Through the three line context of the change?
This is how git apply describes it:
Ensure at least <n> lines of surrounding context match before and after each change.
When fewer lines of surrounding context exist they all must match.
So yes, even if the line numbers changed, the context is still key in determining if a patch should apply or not.
Regarding the numbering aspect, I didn't test, but see if the --start-number option of git format-patch command can help:
--start-number <n>
Start numbering the patches at <n> instead of 1.

How to edit 300 GB text file (genomics data)?

I have a 300 GB text file that contains genomics data with over 250k records. There are some records with bad data and our genomics program 'Popoolution' allows us to comment out the "bad" records with an asterisk. Our problem is that we cannot find a text editor that will load the data so that we can comment out the bad records. Any suggestions? We have both Windows and Linux boxes.
UPDATE: More information
The program Popoolution (https://code.google.com/p/popoolation/) crashes when it reaches a "bad" record giving us the line number that we can then comment out. Specifically, we get a message from Perl that says "F#€%& Scaffolding". The manual suggests we can just use an asterisk to comment out the bad line. Sadly, we will have to repeat this process many times...
One more thought... Is there an approach that would allow us to add the asterisk to the line without opening the entire text file at once. This could be very useful given that we will have to repeat the process an unknown number of times.

Based on your update:
One more thought... Is there an approach that would allow us to add
the asterisk to the line without opening the entire text file at once.
This could be very useful given that we will have to repeat the
process an unknown number of times.
Here you have an approach: If you know the line number, you can add an asterisk in the beginning of that line saying:
sed 'LINE_NUMBER s/^/*/' file
See an example:
$ cat file
aa
bb
cc
dd
ee
$ sed '3 s/^/*/' file
aa
bb
*cc
dd
ee
If you add -i, the file will be updated:
$ sed -i '3 s/^/*/' file
$ cat file
aa
bb
*cc
dd
ee
Even though I always think it's better to do a redirection to another file
sed '3 s/^/*/' file > new_file
so that you keep intact your original file and save the updated one in new_file.

If you are required to have a person mark these records manually with a text editor, for whatever reason, you should probably use split to split the file up into manageable pieces.
split -a4 -d -l100000 hugefile.txt part.
This will split the file up into pieces with 100000 lines each. The names of the files will be part.0000, part.0001, etc. Then, after all the files have been edited, you can combine them back together with cat:
cat part.* > new_hugefile.txt

The simplest solution is to use a stream-oriented editor such as sed. All you need is to be able to write one or more regular expression(s) that will identify all (and only) the bad records. Since you haven't provided any details on how to identify the bad records, this is the only possible answer.

A basic pattern in R is to read the data in chunks, edit, and write out
fin = file("fin.txt", "r")
fout = file("fout.txt", "w")
while (length(txt <- readLines(fin, n=1000000))) {
## txt is now 1000000 lines, add an asterix to problem lines
## bad = <create logical vector indicating bad lines here>
## txt[bad] = paste0("*", txt[bad])
writeLines(txt, fout)
}
close(fin); close(fout)
While not ideal, this works on Windows (implied by the mention of Notepad++) and in a language that you are presumably familiar (R). Using sed (definitely the appropriate tool in the long run) would require installation of additional software and coming up to speed with sed.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to understand diff -u in linux? - linux

example code diff -r -u -P a.c b.c > diff.patch I've tried to search in man. man says that diff -u is to unify the pattern of output, what is the meaning of that and when should we use it? thanks a lot.

Related

Using diff command, ignore character at end of line

Using diff for two files and send by email [closed]

linux cmp utility output: what is a "line"?

how did git handle one of several patches?

How to edit 300 GB text file (genomics data)?

Categories

Resources