Using 'diff' (or anything else) to get character-level diff between text files

Using 'diff' (or anything else) to get character-level diff between text files - text

I'd like to use 'diff' to get a both line difference between and character difference.
For example, consider:
File 1
abcde
abc
abcccd
File 2
abcde
ab
abccc
Using diff -u I get:
## -1,3 +1,3 ##
abcde
-abc
-abcccd
\ No newline at end of file
+ab
+abccc
\ No newline at end of file
However, it only shows me that were changes in these lines. What I'd like to see is something like:
## -1,3 +1,3 ##
abcde
-ab<ins>c</ins>
-abccc<ins>d</ins>
\ No newline at end of file
+ab
+abccc
\ No newline at end of file
You get my drift.
Now, I know I can use other engines to mark/check the difference on a specific line. But I'd rather use one tool that does all of it.

Git has a word diff, and defining all characters as words effectively gives you a character diff. However, newline changes are ignored.
Example
Create a repository like this:
mkdir chardifftest
cd chardifftest
git init
echo -e 'foobarbaz\ncatdog\nfox' > file
git add -A; git commit -m 1
echo -e 'fuobArbas\ncat\ndogfox' > file
git add -A; git commit -m 2
Now, do git diff --word-diff=color --word-diff-regex=. master^ master and you'll get:
Note how both additions and deletions are recognized at the character level, while both additions and deletions of newlines are ignored.
You may also want to try one of these:
git diff --word-diff=plain --word-diff-regex=. master^ master
git diff --word-diff=porcelain --word-diff-regex=. master^ master

You can use:
diff -u f1 f2 |colordiff |diff-highlight
colordiff is a Ubuntu package. You can install it using sudo apt-get install colordiff.
diff-highlight is from git (since version 2.9). It is located in /usr/share/doc/git/contrib/diff-highlight/diff-highlight. You can put it somewhere in your $PATH.

Python's difflib is ace if you want to do this programmatically. For interactive use, I use vim's diff mode (easy enough to use: just invoke vim with vimdiff a b). I also occaisionally use Beyond Compare, which does pretty much everything you could hope for from a diff tool.
I haven't see any command line tool which does this usefully, but as Will notes, the difflib example code might help.

You can use the cmp command in Solaris:
cmp
Compare two files, and if they differ, tells the first byte and line number where they differ.

Python has convenient library named difflib which might help answer your question.
Below are two oneliners using difflib for different python versions.
python3 -c 'import difflib, sys; \
print("".join( \
difflib.ndiff( \
open(sys.argv[1]).readlines(),open(sys.argv[2]).readlines())))'
python2 -c 'import difflib, sys; \
print "".join( \
difflib.ndiff( \
open(sys.argv[1]).readlines(), open(sys.argv[2]).readlines()))'
These might come in handy as a shell alias which is easier to move around with your .${SHELL_NAME}rc.
$ alias char_diff="python2 -c 'import difflib, sys; print \"\".join(difflib.ndiff(open(sys.argv[1]).readlines(), open(sys.argv[2]).readlines()))'"
$ char_diff old_file new_file
And more readable version to put in a standalone file.
#!/usr/bin/env python2
from __future__ import with_statement
import difflib
import sys
with open(sys.argv[1]) as old_f, open(sys.argv[2]) as new_f:
old_lines, new_lines = old_f.readlines(), new_f.readlines()
diff = difflib.ndiff(old_lines, new_lines)
print ''.join(diff)

Coloured, character-level diff ouput
Here's what you can do with the the below script and diff-highlight (which is part of git):
#!/bin/sh -eu
# Use diff-highlight to show word-level differences
diff -U3 --minimal "$#" |
sed 's/^-/\x1b[1;31m-/;s/^+/\x1b[1;32m+/;s/^#/\x1b[1;34m#/;s/$/\x1b[0m/' |
diff-highlight
(Credit to #retracile's answer for the sed highlighting)

cmp -l file1 file2 | wc
Worked well for me. The leftmost number of the result indicates the number of characters that differ.

I also wrote my own script to solve this problem using the Longest common subsequence algorithm.
It is executed as such
JLDiff.py a.txt b.txt out.html
The result is in html with red and green coloring. Larger files do exponentually take a longer amount of time to process but this does a true character by character comparison without checking line by line first.

Python's difflib can do this.
The documentation includes an example command-line program for you.
The exact format is not as you specified, but it would be straightforward to either parse the ndiff-style output or to modify the example program to generate your notation.

As one comment to main answer said you don't have to commit to use git diff:
git diff --word-diff=color --word-diff-regex=. file1 file2
green would be the character that is added by the second file.
red would be the character that is added by the first file.

Here is an online text comparison tool:
http://text-compare.com/
It can highlight every single char that is different and continues compare the rest.

ccdiff is a convenient dedicated tool for the task. Here is what your example looks like with it:
By default, it highlights the differences in color, but it can be used in a console without color support too.
The package is included in the main repository of Debian:
ccdiff is a colored diff that also colors inside changed lines.
All command-line tools that show the difference between two files fall short in showing minor changes visuably useful. ccdiff tries to give the look and feel of diff --color or colordiff, but extending the display of colored output from colored deleted and added lines to colors for deleted and addedd characters within the changed lines.

Not a complete answer, but if cmp -l's output is not clear enough, you can use:
sed 's/\(.\)/\1\n/g' file1 > file1.vertical
sed 's/\(.\)/\1\n/g' file2 > file2.vertical
diff file1.vertical file2.vertical

If you keep your files in Git, you can diff between versions with the diff-highlight script, which will show different lines, with differences highlighted.
Unfortunately it only works when the number of lines removed matches the number of lines added - there is stub code for when lines don't match, so presumably this could be fixed in the future.

I think the simpler solution is always a good solution.
In my case, the below code helps me a lot. I hope it helps
anybody else.
#!/bin/env python
def readfile( fileName ):
f = open( fileName )
c = f.read()
f.close()
return c
def diff( s1, s2 ):
counter=0
for ch1, ch2 in zip( s1, s2 ):
if not ch1 == ch2:
break
counter+=1
return counter < len( s1 ) and counter or -1
import sys
f1 = readfile( sys.argv[1] )
f2 = readfile( sys.argv[2] )
pos = diff( f1, f2 )
end = pos+200
if pos >= 0:
print "Different at:", pos
print ">", f1[pos:end]
print "<", f2[pos:end]
You can compare two files with the following syntax at your favorite terminal:
$ ./diff.py fileNumber1 fileNumber2

Most of these answers mention using of diff-highlight, a Perl module. But I didn't want to figure out how to install a Perl module. So I made a few minor changes to it to be a self-contained Perl script.
You can install it using:
▶ curl -o /usr/local/bin/DiffHighlight.pl \
https://raw.githubusercontent.com/alexharv074/scripts/master/DiffHighlight.pl
And the usage (if you have the Ubuntu colordiff mentioned in zhanxw's answer):
▶ diff -u f1 f2 | colordiff | DiffHighlight.pl
And the usage (if you don't):
▶ diff -u f1 f2 | DiffHighlight.pl

Related

How do I extract specific strings from multiple files and write them to .txt in bash?

I have a lot of files in the folder filesToCheck, some examples given below. I need the output result.txt as also shown below. I can use linux bash with any commands that do not require extra installations.
The try-out below (with help from stackoverflow) has two problems when I execute it. It only looks for one instance of CAKE_FROSTING despite the global flag g and the file result.txt remains empty despite > result.txt.
sed -Enz 's/.*CAKE_FROSTING\(\n?"([^"]*).*/\1\n/gp' filesToCheck/* > result.txt
What do I need to change?
file1.cpp
something
CAKE_FROSTING("is.simply.the.best", "[no][matter][what]") { DO(something(0) == 1); }
file2.h
something else
CAKE_FROSTING(
"is.kinda.neat",
"[i][agree]") something else
something more
file3.cpp
random_text CAKE_FROSTING("Can be nice") "more random text"
CAKE_CREAM("totally.sucks", "[trust][me]")
random_text CAKE_FROSTING("Can be very nice") "indeed"
desiredResult.txt
is.simply.the.best
is.kinda.neat
Can be nice
Can be very nice
currentResult command line output:
is.simply.the.best
is.kinda.neat
Can be nice

Assuming the string CAKE_FROSTING occurs once per line, you can try this sed
$ sed -En ':a;N;s/.*CAKE_FROSTING\(\n?"([^"]*).*/\1/p;ba' filesToCheck/*
is.simply.the.best
is.kinda.neat
Can be nice
Can be very nice

Handling expanded git commands with python subprocess module

I'm trying to retrieve and work with data from historical versions of files in a git repo. I'd like to have something like a dictionary that holds <hash>, <time of commit>, <value retrieved from contents of a file revision>, <commit message> for each entry.
I figured the data I retrieve from each file revision, and any calculations done with them, would be best handled using python. And the subprocess module appeared to be the best fit to integrate my git commands.
Below I show how I'm defining a function getval(key, filename) that I had hoped would output <SHA-1 hash>:<Value> to console, but would like to have a dict with more info... also with <time>, and <commit message>.
I help operate an ion accelerator, where we store 'savesets'--or values relevant to a given accelerator tune--using git. Of the values in these files, are things like charge(Q) and mass(A). Ultimately, I want to retrieve both values, get the ratio (Q/A), and display a list of file revision hashes sorted by the charge:mass ratio of the ion we delivered with the settings in that file's revision.
Sample of file (for 56Fe17+):
# Date: 2018-12-21 01:49:16.888
PV,SELECTED,TIMESTAMP,STATUS,SEVERITY,VALUE_TYPE,VALUE,READBACK,READBACK_VALUE,DELTA,READ_ONLY
REA_EXP:LINE,0,1544047322.881066957,NO_ALARM,NONE,enum,"JENSA~[UDF;AT-TPC;GPL;JENSA]",,"---",,true
REA_BTS19:BEAM:OPTICSFILE,0,1541798820.065952460,NO_ALARM,NONE,string,"BTS19_test3.data",,"---",,true
REA_BTS19:BEAM:A_BOOK,0,1545322510.562031883,NO_ALARM,NONE,double,"56.0",,"---",,true
REA_BTS19:BEAM:Z_BOOK,0,1545322567.544226340,NO_ALARM,NONE,double,"26.0",,"---",,true
REA_BTS19:BEAM:Q_BOOK,0,1545322512.701768974,NO_ALARM,NONE,double,"17.0",,"---",,true
So far--and with the help of others here--I've figured out a git one-liner that greps the revision history of a given file for a key[a string] and uses sed and awk to output <hash>:<val associated with the key>.
Git Oneliner I'm Starting with:
git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) -- ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp | sed 's/:/,/' | awk -F, '{print $1 ":" $8}'
Oneliner's Output
e78f73fe6f90e93d5b3ccf90975b0e540d12ce09:"56.0"
4b94745bd0a6594bb42a774c95b5fc0847ef2d82:"56.0"
f2d5e263deac1d9112be791b39f4ce1b1b34e55d:"56.0"
c03800de52143ddb2abfab51fcc665ff5470e363:"56.0"
4a3a564a6d87bc6ff5f3dc7fec7670aeecfe6a79:"58.0"
d591941e51c4eab1237ce726a2a49448114b8f26:"58.0"
a9c8f5cdf224ff4fd94514c33888796760afd792:"58.0"
2f221492beea1663216dcfb27da89343817b11fd:"58.0"
I've also started playing with the subprocess python module. But I'm struggling to figure out how to handle my more complicated git commands. Generally, I'll want to be able to pass a key, and a file.. something like getval(key, filename).
When my cmd string was ['git', 'grep', str, '$(git rev-list --all)', '--', pathspec], it returned errors stating that '$(git rev-list --all)' was ambiguous. Thinking it wasn't being expanded, I added a separate process to execute the nested command, but I'm not sure I'm doing this correctly.
My Python file (gitfun.py): which I'm currently running the function from
import sys, os
import subprocess
def getval(str, pathspec, repoDir='/mnt/d/stash.projects/rea'):
p1 = subprocess.Popen(["git", "rev-list", "--all"], stdout=subprocess.PIPE)
output, err = p1.communicate()
cmd = ['git', 'grep', str, output, '--', pathspec]
p2 = subprocess.Popen(cmd, cwd=repoDir)
p2.wait()
cwd = '/mnt/d/stash.projects/rea'
filename = 'ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp'
os.chdir(cwd)
getval('BTS19:BEAM:A_BOOK', filename)
Currently it is returning 'file name too long' so (even though I'm not convinced it really is too long) I tried changing my core.longpaths in git config to true, however this had no effect. Again why I suspect I'm not handling my replacement of the $(git rev-list --all) expansion correctly.
For this code, I expect something that looks like this:
522628b8d3db01ac330240b28935933b0448649c:ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1545240215.74320185
5,NO_ALARM,NONE,double,"58.0",,"---",,true
2557c599d2dc67d80ffc5b9be3f79899e0c15a10:ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1545240215.74320185
5,NO_ALARM,NONE,double,"58.0",,"---",,true
7fc97ec2aa76f32265196c42dbcd289c49f0ad93:ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1545240215.74320185
5,NO_ALARM,NONE,double,"58.0",,"---",,true
...
But I ultimately want an output to console that looks identical to the git one-liner above, or better yet, a dict that I can print to console or do other things with.

Remember that your shell tokenizes the command line using white space.
When you run git rev-list --all, you get output like:
2a4be2748fad885f88163a5b9b1b438fe3cb2ece
c1a30c743eb810fbefe1dc314277931fa33842b3
b2e5c75131e94a3543e5dcf9fb641ccd553906b4
95718f7e128a8b36ca93d6589328cc5b739668b1
87a9ada188a8cd1c13e48c21f093be7027d61eca
When you substitute that into your git grep command...
git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) -- \
ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp
...each line is a separate argument. That is, if the output of git rev-list --all was exactly what I've shown above, then your one-liner would be tokenized into the following arguments, which I have listed one per line for clarity:
git
grep
BTS19:BEAM:A_BOOK
2a4be2748fad885f88163a5b9b1b438fe3cb2ece
c1a30c743eb810fbefe1dc314277931fa33842b3
b2e5c75131e94a3543e5dcf9fb641ccd553906b4
95718f7e128a8b36ca93d6589328cc5b739668b1
87a9ada188a8cd1c13e48c21f093be7027d61eca
--
ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp
But you're not doing this in your Python code! You're pasing the entire output of git rev-list --all as a single argument. That means the command you're trying to execute has a fixed number (6) of arguments:
git
grep
BTS19:BEAM:A_BOOK
2a4be2748fad885f88163a5b9b1b438fe3cb2ece c1a30c743eb810fbefe1dc314277931fa33842b3 b2e5c75131e94a3543e5dcf9fb641ccd553906b4 95718f7e128a8b36ca93d6589328cc5b739668b1 87a9ada188a8cd1c13e48c21f093be7027d61eca
--
ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp
All those revisions are getting bundled together in a single argument, which is where the "filename too long" error comes from. You need to split that output into multiple arguments just like the shell does:
p1 = subprocess.Popen(["git", "rev-list", "--all"], stdout=subprocess.PIPE)
output, err = p1.communicate()
cmd = ['git', 'grep', str] + output.splitlines() + ['--', pathspec]
p2 = subprocess.Popen(cmd, cwd=repoDir)
p2.wait()

change multiple files commandline

I have separated some tracks from mp3 mixes using mp3splt.
BASH: (mp3splt -c('**!!***use .cue file***!!**') [cuefile.cue] [nonstopmix.mp3] ~for anyone interested, is in the Ubu repos~)
And I ended up with these filenames: "Antares" - 01 - "Xibalba".mp3 which is not a format I prefer, now I've made it a little project to change them with a shell script but its more difficult than I anticipated.
I want to change the filename from:
"Antares" - 01 - "Xibalba".mp
to:
01-Antares_-_Xibalba.mp3
so far I've used :
for var in *.mp3; do mv $var {var/"/}; done
and I could repeat that until I'm through, delete the 0x number and add one but I'd like to do it more efficient.
Could anyone give me a pointer (!not a script!) ?
I'd still like to write it myself but there's so much options that I'm a bit lost.
so far I thought to use this program flow:
read all the filenames containing .mp3 and declare as variable $var
strip $var from quotes
select 0x number, append delimiter _ (0x_)
move 0x_ to the beginning of the string
select remaining ' - - ' and change to '-'
done
which bash programs to use? especially changing the 0x puzzles me cuz I need a loop which increments this number and test if it is present in the filename variable and then it has to be changed.

It is easy to do in python 2.x. You can use this logic in any language you want.
import string
a=raw_input('Enter the name of song')
a=a.replace('"', "")
a=a.replace('.mp', ' .mp3')
words = a.split()
print words[2]+'-'+words[0]+'_-_'+words[4]+words[5]
Logic:
I removed ", then make .mp to .mp3, then splitted the string, which created a list ( array ) and then printed the elements according to need.

Try doing this :
rename -n 's/"(\w+)"\s+-\s*(\d+)\s*-\s*"(\w+)"\.mp/$2-$1_-_$3.mp3/' *mp
from the shell prompt. It's very useful, you can put some perl tricks like I does in a substitution.
You can remove the -n (dry-run mode switch) when your tests become valids.
There are other tools with the same name which may or may not be able to do this, so be careful.
If you run the following command (linux)
$ file $(readlink -f $(type -p rename))
and you have a result like
.../rename: Perl script, ASCII text executable
then this seems to be the right tool =)
If not, to make it the default (usually already the case) on Debian and derivative like Ubuntu :
$ sudo update-alternatives --set rename /path/to/rename
Last but not least, this tool was originally written by Larry Wall, the Perl's dad.

Manually merge two files using diff

I'd like to merge two files by doing the following:
Output the diff of the two files into a temp file and
Manually select the lines I want to copy/save.
The problem here is that diff -u only gives me a file lines of context, while I want to output the entire file in a unified format.
Is there any way diff can do this?

One option that might fit the bill for you,
sdiff : side-by-side diff of files.
sdiff -o merged.file left.file right.file
Once there, it will prompt you with what lines you want to keep from which file. Hit ? and then enter for a little help. Also man sdiff with the detailed goods.
(In my distro, these come packaged in the "diffutils" package [fedora,centos])
If you need to automate the process, you might want to try the util merge, which will mark conflicts in the files. However, that might put you back at square one.

"I want to output the entire file in a unified format. Is there any way diff can do this?"
Yes.
diff -U 9999999 file1.txt file2.txt > diff.txt
This should work, provided your files are less than 10 million lines long.

You can merge/combine the two files with diff using --
diff --line-format %L file1 file2

The easy answer is to use the -D flag to merge the files and surround the differences with C style #ifdef statements.
From the documentation:
-D NAME --ifdef=NAME
Output merged file to show `#ifdef NAME' diffs.
You can use it as follows:
$ diff -D NEWSTUFF file1 file2 > merged_file
I usually then just open the merged file in an editor and resolve the merge conflicts by hand.
You also can use options to output an ed script, etc.

If you are an emacs user, you can do this directly in emacs using the "emerge" tool:
https://www.gnu.org/software/emacs/manual/html_node/emacs/Emerge.html
Issuing M-x emerge-files will open an interactive prompt with a view of files A, B, and the merged file to allow choosing text that differs between files A & B, inserting part of A into B, and more.

compare two files and save the difference in linux

I like to compare two text files and save the difference under linux.
I know there are tools like kdiff, diff vimdiff etc. but my expectation are as follows.
Output should be in a separate file
The difference should be quoted with colours, ex: delete line in red and added line in green something like that
It should ignore space differences
It should be an opensource tool

use tkdiff4 -w file-name1 file-name2
It fulfills all your requirements. Specific color might be an issue.

try colordiff and man diff for options for ignoring whitespace etc

Like,
#!/bin/bash
wdiff -w "\e[31m" -x "\e[0m" -y "\e[32m" -z "\e[0m" "$#";
replace \e by, well, the ASCII character with value 0x1A. Put the two commands into some file, and run it using redirection.

Save the changes to a file:
diff -Nur originalfile newfile > patchfile
Use the difference file to change the origin file:
patch originfile patchfile
I think this is the easiest way to save the changes and reload the changes.
By the way, you can use this command the create an update-package.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Using 'diff' (or anything else) to get character-level diff between text files - text

You can use the cmp command in Solaris: cmp Compare two files, and if they differ, tells the first byte and line number where they differ.

cmp -l file1 file2 | wc Worked well for me. The leftmost number of the result indicates the number of characters that differ.

Python's difflib can do this. The documentation includes an example command-line program for you. The exact format is not as you specified, but it would be straightforward to either parse the ndiff-style output or to modify the example program to generate your notation.

As one comment to main answer said you don't have to commit to use git diff: git diff --word-diff=color --word-diff-regex=. file1 file2 green would be the character that is added by the second file. red would be the character that is added by the first file.

Here is an online text comparison tool: http://text-compare.com/ It can highlight every single char that is different and continues compare the rest.

Not a complete answer, but if cmp -l's output is not clear enough, you can use: sed 's/\(.\)/\1\n/g' file1 > file1.vertical sed 's/\(.\)/\1\n/g' file2 > file2.vertical diff file1.vertical file2.vertical

Related

How do I extract specific strings from multiple files and write them to .txt in bash?

Handling expanded git commands with python subprocess module

change multiple files commandline

Manually merge two files using diff

compare two files and save the difference in linux

Categories

Resources