Handling expanded git commands with python subprocess module - python-3.x

I'm trying to retrieve and work with data from historical versions of files in a git repo. I'd like to have something like a dictionary that holds <hash>, <time of commit>, <value retrieved from contents of a file revision>, <commit message> for each entry.
I figured the data I retrieve from each file revision, and any calculations done with them, would be best handled using python. And the subprocess module appeared to be the best fit to integrate my git commands.
Below I show how I'm defining a function getval(key, filename) that I had hoped would output <SHA-1 hash>:<Value> to console, but would like to have a dict with more info... also with <time>, and <commit message>.
I help operate an ion accelerator, where we store 'savesets'--or values relevant to a given accelerator tune--using git. Of the values in these files, are things like charge(Q) and mass(A). Ultimately, I want to retrieve both values, get the ratio (Q/A), and display a list of file revision hashes sorted by the charge:mass ratio of the ion we delivered with the settings in that file's revision.
Sample of file (for 56Fe17+):
# Date: 2018-12-21 01:49:16.888
PV,SELECTED,TIMESTAMP,STATUS,SEVERITY,VALUE_TYPE,VALUE,READBACK,READBACK_VALUE,DELTA,READ_ONLY
REA_EXP:LINE,0,1544047322.881066957,NO_ALARM,NONE,enum,"JENSA~[UDF;AT-TPC;GPL;JENSA]",,"---",,true
REA_BTS19:BEAM:OPTICSFILE,0,1541798820.065952460,NO_ALARM,NONE,string,"BTS19_test3.data",,"---",,true
REA_BTS19:BEAM:A_BOOK,0,1545322510.562031883,NO_ALARM,NONE,double,"56.0",,"---",,true
REA_BTS19:BEAM:Z_BOOK,0,1545322567.544226340,NO_ALARM,NONE,double,"26.0",,"---",,true
REA_BTS19:BEAM:Q_BOOK,0,1545322512.701768974,NO_ALARM,NONE,double,"17.0",,"---",,true
So far--and with the help of others here--I've figured out a git one-liner that greps the revision history of a given file for a key[a string] and uses sed and awk to output <hash>:<val associated with the key>.
Git Oneliner I'm Starting with:
git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) -- ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp | sed 's/:/,/' | awk -F, '{print $1 ":" $8}'
Oneliner's Output
e78f73fe6f90e93d5b3ccf90975b0e540d12ce09:"56.0"
4b94745bd0a6594bb42a774c95b5fc0847ef2d82:"56.0"
f2d5e263deac1d9112be791b39f4ce1b1b34e55d:"56.0"
c03800de52143ddb2abfab51fcc665ff5470e363:"56.0"
4a3a564a6d87bc6ff5f3dc7fec7670aeecfe6a79:"58.0"
d591941e51c4eab1237ce726a2a49448114b8f26:"58.0"
a9c8f5cdf224ff4fd94514c33888796760afd792:"58.0"
2f221492beea1663216dcfb27da89343817b11fd:"58.0"
I've also started playing with the subprocess python module. But I'm struggling to figure out how to handle my more complicated git commands. Generally, I'll want to be able to pass a key, and a file.. something like getval(key, filename).
When my cmd string was ['git', 'grep', str, '$(git rev-list --all)', '--', pathspec], it returned errors stating that '$(git rev-list --all)' was ambiguous. Thinking it wasn't being expanded, I added a separate process to execute the nested command, but I'm not sure I'm doing this correctly.
My Python file (gitfun.py): which I'm currently running the function from
import sys, os
import subprocess
def getval(str, pathspec, repoDir='/mnt/d/stash.projects/rea'):
p1 = subprocess.Popen(["git", "rev-list", "--all"], stdout=subprocess.PIPE)
output, err = p1.communicate()
cmd = ['git', 'grep', str, output, '--', pathspec]
p2 = subprocess.Popen(cmd, cwd=repoDir)
p2.wait()
cwd = '/mnt/d/stash.projects/rea'
filename = 'ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp'
os.chdir(cwd)
getval('BTS19:BEAM:A_BOOK', filename)
Currently it is returning 'file name too long' so (even though I'm not convinced it really is too long) I tried changing my core.longpaths in git config to true, however this had no effect. Again why I suspect I'm not handling my replacement of the $(git rev-list --all) expansion correctly.
For this code, I expect something that looks like this:
522628b8d3db01ac330240b28935933b0448649c:ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1545240215.74320185
5,NO_ALARM,NONE,double,"58.0",,"---",,true
2557c599d2dc67d80ffc5b9be3f79899e0c15a10:ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1545240215.74320185
5,NO_ALARM,NONE,double,"58.0",,"---",,true
7fc97ec2aa76f32265196c42dbcd289c49f0ad93:ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp:REA_BTS19:BEAM:A_BOOK,0,1545240215.74320185
5,NO_ALARM,NONE,double,"58.0",,"---",,true
...
But I ultimately want an output to console that looks identical to the git one-liner above, or better yet, a dict that I can print to console or do other things with.

Remember that your shell tokenizes the command line using white space.
When you run git rev-list --all, you get output like:
2a4be2748fad885f88163a5b9b1b438fe3cb2ece
c1a30c743eb810fbefe1dc314277931fa33842b3
b2e5c75131e94a3543e5dcf9fb641ccd553906b4
95718f7e128a8b36ca93d6589328cc5b739668b1
87a9ada188a8cd1c13e48c21f093be7027d61eca
When you substitute that into your git grep command...
git grep 'BTS19:BEAM:A_BOOK' $(git rev-list --all) -- \
ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp
...each line is a separate argument. That is, if the output of git rev-list --all was exactly what I've shown above, then your one-liner would be tokenized into the following arguments, which I have listed one per line for clarity:
git
grep
BTS19:BEAM:A_BOOK
2a4be2748fad885f88163a5b9b1b438fe3cb2ece
c1a30c743eb810fbefe1dc314277931fa33842b3
b2e5c75131e94a3543e5dcf9fb641ccd553906b4
95718f7e128a8b36ca93d6589328cc5b739668b1
87a9ada188a8cd1c13e48c21f093be7027d61eca
--
ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp
But you're not doing this in your Python code! You're pasing the entire output of git rev-list --all as a single argument. That means the command you're trying to execute has a fixed number (6) of arguments:
git
grep
BTS19:BEAM:A_BOOK
2a4be2748fad885f88163a5b9b1b438fe3cb2ece c1a30c743eb810fbefe1dc314277931fa33842b3 b2e5c75131e94a3543e5dcf9fb641ccd553906b4 95718f7e128a8b36ca93d6589328cc5b739668b1 87a9ada188a8cd1c13e48c21f093be7027d61eca
--
ReAccelerator/Snapshots/RFQ-JENSA_Setpoints.snp
All those revisions are getting bundled together in a single argument, which is where the "filename too long" error comes from. You need to split that output into multiple arguments just like the shell does:
p1 = subprocess.Popen(["git", "rev-list", "--all"], stdout=subprocess.PIPE)
output, err = p1.communicate()
cmd = ['git', 'grep', str] + output.splitlines() + ['--', pathspec]
p2 = subprocess.Popen(cmd, cwd=repoDir)
p2.wait()

Related

How to execute svn command along with grep on windows?

Trying to execute svn command on windows machine and capture the output for the same.
Code:
import subprocess
cmd = "svn log -l1 https://repo/path/trunk | grep ^r | awk '{print \$3}'"
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True)
'grep' is not recognized as an internal or external command,
operable program or batch file.
I do understand that 'grep' is not windows utility.
Getting error as "grep' is not recognized as an internal or external command,
operable program or batch file."
Is it only limited to execute on Linux?
Can we execute the same on Windows?
Is my code right?
For windows your command will look something like the following
svn log -l1 https://repo/path/trunk | find "string_to_find"
You need to use the find utility in windows to get the same effect as grep.
svn --version | find "ra"
* ra_svn : Module for accessing a repository using the svn network protocol.
* ra_local : Module for accessing a repository on local disk.
* ra_serf : Module for accessing a repository via WebDAV protocol using serf.
Use svn log --search FOO instead of grep-ing the command's output.
grep and awk are certainly available for Windows as well, but there is really no need to install them -- the code is easy to replace with native Python.
import subprocess
p = subprocess.run(["svn", "log", "-l1", "https://repo/path/trunk"],
capture_output=True, text=True)
for line in p.stdout.splitlines():
# grep ^r
if line.startswith('r'):
# awk '{ print $3 }'
print(line.split()[2])
Because we don't need a pipeline, and just run a single static command, we can avoid shell=True.
Because we don't want to do the necessary plumbing (which you forgot anyway) for Popen(), we prefer subprocess.run(). With capture_output=True we conveniently get its output in the resulting object's stdout atrribute; because we expect text output, we pass text=True (in older Python versions you might need to switch to the old, slightly misleading synonym universal_newlines=True).
I guess the intent is to search for the committer in each revision's output, but this will incorrectly grab the third token on any line which starts with an r (so if you have a commit message like "refactored to use Python native code" the code will extract use from that). A better approach altogether is to request machine-readable output from svn and parse that (but it's unfortunately rather clunky XML, so there's another not entirely trivial rabbithole for you). Perhaps as middle ground implement a more specific pattern for finding those lines -- maybe look for a specific number of fields, and static strings where you know where to expect them.
if line.startswith('r'):
fields = line.split()
if len(fields) == 13 and fields[1] == '|' and fields[3] == '|':
print(fields[2])
You could also craft a regular expression to look for a date stamp in the third |-separated field, and the number of changed lines in the fourth.
For the record, a complete commit message from Subversion looks like
------------------------------------------------------------------------
r16110 | tripleee | 2020-10-09 10:41:13 +0300 (Fri, 09 Oct 2020) | 4 lines
refactored to use native Python instead of grep + awk
(which is a useless use of grep anyway; see http://www.iki.fi/era/unix/award.html#grep)

Is there a way to know how the user invoked a program from bash?

Here's the problem: I have this script foo.py, and if the user invokes it without the --bar option, I'd like to display the following error message:
Please add the --bar option to your command, like so:
python foo.py --bar
Now, the tricky part is that there are several ways the user might have invoked the command:
They may have used python foo.py like in the example
They may have used /usr/bin/foo.py
They may have a shell alias frob='python foo.py', and actually ran frob
Maybe it's even a git alias flab=!/usr/bin/foo.py, and they used git flab
In every case, I'd like the message to reflect how the user invoked the command, so that the example I'm providing would make sense.
sys.argv always contains foo.py, and /proc/$$/cmdline doesn't know about aliases. It seems to me that the only possible source for this information would be bash itself, but I don't know how to ask it.
Any ideas?
UPDATE How about if we limit possible scenarios to only those listed above?
UPDATE 2: Plenty of people wrote very good explanation about why this is not possible in the general case, so I would like to limit my question to this:
Under the following assumptions:
The script was started interactively, from bash
The script was start in one of these 3 ways:
foo <args> where foo is a symbolic link /usr/bin/foo -> foo.py
git foo where alias.foo=!/usr/bin/foo in ~/.gitconfig
git baz where alias.baz=!/usr/bin/foo in ~/.gitconfig
Is there a way to distinguish between 1 and (2,3) from within the script? Is there a way to distinguish between 2 and 3 from within the script?
I know this is a long shot, so I'm accepting Charles Duffy's answer for now.
UPDATE 3: So far, the most promising angle was suggested by Charles Duffy in the comments below. If I can get my users to have
trap 'export LAST_BASH_COMMAND=$(history 1)' DEBUG
in their .bashrc, then I can use something like this in my code:
like_so = None
cmd = os.environ['LAST_BASH_COMMAND']
if cmd is not None:
cmd = cmd[8:] # Remove the history counter
if cmd.startswith("foo "):
like_so = "foo --bar " + cmd[4:]
elif cmd.startswith(r"git foo "):
like_so = "git foo --bar " + cmd[8:]
elif cmd.startswith(r"git baz "):
like_so = "git baz --bar " + cmd[8:]
if like_so is not None:
print("Please add the --bar option to your command, like so:")
print(" " + like_so)
else:
print("Please add the --bar option to your command.")
This way, I show the general message if I don't manage to get their invocation method. Of course, if I'm going to rely on changing my users' environment I might as well ensure that the various aliases export their own environment variables that I can look at, but at least this way allows me to use the same technique for any other script I might add later.
No, there is no way to see the original text (before aliases/functions/etc).
Starting a program in UNIX is done as follows at the underlying syscall level:
int execve(const char *path, char *const argv[], char *const envp[]);
Notably, there are three arguments:
The path to the executable
An argv array (the first item of which -- argv[0] or $0 -- is passed to that executable to reflect the name under which it was started)
A list of environment variables
Nowhere in here is there a string that provides the original user-entered shell command from which the new process's invocation was requested. This is particularly true since not all programs are started from a shell at all; consider the case where your program is started from another Python script with shell=False.
It's completely conventional on UNIX to assume that your program was started through whatever name is given in argv[0]; this works for symlinks.
You can even see standard UNIX tools doing this:
$ ls '*.txt' # sample command to generate an error message; note "ls:" at the front
ls: *.txt: No such file or directory
$ (exec -a foobar ls '*.txt') # again, but tell it that its name is "foobar"
foobar: *.txt: No such file or directory
$ alias somesuch=ls # this **doesn't** happen with an alias
$ somesuch '*.txt' # ...the program still sees its real name, not the alias!
ls: *.txt: No such file
If you do want to generate a UNIX command line, use pipes.quote() (Python 2) or shlex.quote() (Python 3) to do it safely.
try:
from pipes import quote # Python 2
except ImportError:
from shlex import quote # Python 3
cmd = ' '.join(quote(s) for s in open('/proc/self/cmdline', 'r').read().split('\0')[:-1])
print("We were called as: {}".format(cmd))
Again, this won't "un-expand" aliases, revert to the code that was invoked to call a function that invoked your command, etc; there is no un-ringing that bell.
That can be used to look for a git instance in your parent process tree, and discover its argument list:
def find_cmdline(pid):
return open('/proc/%d/cmdline' % (pid,), 'r').read().split('\0')[:-1]
def find_ppid(pid):
stat_data = open('/proc/%d/stat' % (pid,), 'r').read()
stat_data_sanitized = re.sub('[(]([^)]+)[)]', '_', stat_data)
return int(stat_data_sanitized.split(' ')[3])
def all_parent_cmdlines(pid):
while pid > 0:
yield find_cmdline(pid)
pid = find_ppid(pid)
def find_git_parent(pid):
for cmdline in all_parent_cmdlines(pid):
if cmdline[0] == 'git':
return ' '.join(quote(s) for s in cmdline)
return None
See the Note at the bottom regarding the originally proposed wrapper script.
A new more flexible approach is for the python script to provide a new command line option, permitting users to specify a custom string they would prefer to see in error messages.
For example, if a user prefers to call the python script 'myPyScript.py' via an alias, they can change the alias definition from this:
alias myAlias='myPyScript.py $#'
to this:
alias myAlias='myPyScript.py --caller=myAlias $#'
If they prefer to call the python script from a shell script, it can use the additional command line option like so:
#!/bin/bash
exec myPyScript.py "$#" --caller=${0##*/}
Other possible applications of this approach:
bash -c myPyScript.py --caller="bash -c myPyScript.py"
myPyScript.py --caller=myPyScript.py
For listing expanded command lines, here's a script 'pyTest.py', based on feedback by #CharlesDuffy, that lists cmdline for the running python script, as well as the parent process that spawned it.
If the new -caller argument is used, it will appear in the command line, although aliases will have been expanded, etc.
#!/usr/bin/env python
import os, re
with open ("/proc/self/stat", "r") as myfile:
data = [x.strip() for x in str.split(myfile.readlines()[0],' ')]
pid = data[0]
ppid = data[3]
def commandLine(pid):
with open ("/proc/"+pid+"/cmdline", "r") as myfile:
return [x.strip() for x in str.split(myfile.readlines()[0],'\x00')][0:-1]
pid_cmdline = commandLine(pid)
ppid_cmdline = commandLine(ppid)
print "%r" % pid_cmdline
print "%r" % ppid_cmdline
After saving this to a file named 'pytest.py', and then calling it from a bash script called 'pytest.sh' with various arguments, here's the output:
$ ./pytest.sh a b "c d" e
['python', './pytest.py']
['/bin/bash', './pytest.sh', 'a', 'b', 'c d', 'e']
NOTE: criticisms of the original wrapper script aliasTest.sh were valid. Although the existence of a pre-defined alias is part of the specification of the question, and may be presumed to exist in the user environment, the proposal defined the alias (creating the misleading impression that it was part of the recommendation rather than a specified part of the user's environment), and it didn't show how the wrapper would communicate with the called python script. In practice, the user would either have to source the wrapper or define the alias within the wrapper, and the python script would have to delegate the printing of error messages to multiple custom calling scripts (where the calling information resided), and clients would have to call the wrapper scripts. Solving those problems led to a simpler approach, that is expandable to any number of additional use cases.
Here's a less confusing version of the original script, for reference:
#!/bin/bash
shopt -s expand_aliases
alias myAlias='myPyScript.py'
# called like this:
set -o history
myAlias $#
_EXITCODE=$?
CALL_HISTORY=( `history` )
_CALLING_MODE=${CALL_HISTORY[1]}
case "$_EXITCODE" in
0) # no error message required
;;
1)
echo "customized error message #1 [$_CALLING_MODE]" 1>&2
;;
2)
echo "customized error message #2 [$_CALLING_MODE]" 1>&2
;;
esac
Here's the output:
$ aliasTest.sh 1 2 3
['./myPyScript.py', '1', '2', '3']
customized error message #2 [myAlias]
There is no way to distinguish between when an interpreter for a script is explicitly specified on the command line and when it is deduced by the OS from the hashbang line.
Proof:
$ cat test.sh
#!/usr/bin/env bash
ps -o command $$
$ bash ./test.sh
COMMAND
bash ./test.sh
$ ./test.sh
COMMAND
bash ./test.sh
This prevents you from detecting the difference between the first two cases in your list.
I am also confident that there is no reasonable way of identifying the other (mediated) ways of calling a command.
I can see two ways to do this:
The simplest, as suggested by 3sky, would be to parse the command line from inside the python script. argparse can be used to do so in a reliable way. This only works if you can change that script.
A more complex way, slightly more generic and involved, would be to change the python executable on your system.
Since the first option is well documented, here are a bit more details on the second one:
Regardless of the way your script is called, python is ran. The goal here is to replace the python executable with a script that checks if foo.py is among the arguments, and if it is, check if --bar is as well. If not, print the message and return.
In every other case, execute the real python executable.
Now, hopefully, running python is done trough the following shebang: #!/usr/bin/env python3, or trough python foo.py, rather than a variant of #!/usr/bin/python or /usr/bin/python foo.py. That way, you can change the $PATH variable, and prepend a directory where your false python resides.
In the other case, you can replace the /usr/bin/python executable, at the risk of not playing nice with updates.
A more complex way of doing this would probably be with namespaces and mounts, but the above is probably enough, especially if you have admin rights.
Example of what could work as a script:
#!/usr/bin/env bash
function checkbar
{
for i in "$#"
do
if [ "$i" = "--bar" ]
then
echo "Well done, you added --bar!"
return 0
fi
done
return 1
}
command=$(basename ${1:-none})
if [ $command = "foo.py" ]
then
if ! checkbar "$#"
then
echo "Please add --bar to the command line, like so:"
printf "%q " $0
printf "%q " "$#"
printf -- "--bar\n"
exit 1
fi
fi
/path/to/real/python "$#"
However, after re-reading your question, it is obvious that I misunderstood it. In my opinion, it is all right to just print either "foo.py must be called like foo.py --bar", "please add bar to your arguments" or "please try (instead of )", regardless of what the user entered:
If that's an (git) alias, this is a one time error, and the user will try their alias after creating it, so they know where to put the --bar part
with either with /usr/bin/foo.py or python foo.py:
If the user is not really command line-savvy, they can just paste the working command that is displayed, even if they don't know the difference
If they are, they should be able to understand the message without trouble, and adjust their command line.
I know it's bash task, but i think the easiest way is modify 'foo.py'. Of course it depends on level of script complicated, but maybe it will fit. Here is sample code:
#!/usr/bin/python
import sys
if len(sys.argv) > 1 and sys.argv[1] == '--bar':
print 'make magic'
else:
print 'Please add the --bar option to your command, like so:'
print ' python foo.py --bar'
In this case, it does not matter how user run this code.
$ ./a.py
Please add the --bar option to your command, like so:
python foo.py --bar
$ ./a.py -dua
Please add the --bar option to your command, like so:
python foo.py --bar
$ ./a.py --bar
make magic
$ python a.py --t
Please add the --bar option to your command, like so:
python foo.py --bar
$ /home/3sky/test/a.py
Please add the --bar option to your command, like so:
python foo.py --bar
$ alias a='python a.py'
$ a
Please add the --bar option to your command, like so:
python foo.py --bar
$ a --bar
make magic

How can i use the variable in the path for source directory and destination using groovy

git1 repository have a folder named sports which further have subfolders and files inside it.I want to copy the file given by the output of git diff given by command2.
def command2 = "git diff --stat #{12.hours.ago}"
Process process = command2.execute(null, new File('C:/git1'))
def b=process.text
println b
Output of Groovy Console is Sports/Cricket/Players/Virat.txt.
Now i want to use this file path in sourceDir as shown below.
def sourceDir = "C:/git1/$b"
def destinationDir = "D:/git1/$b"
(new AntBuilder()).copy(file: sourceDir, tofile: destinationDir)
This is giving error as Warning: Could not find file C:\git1\Sports\Cricket\Players\Virat.txt
to copy.
As you can see in your comment with the byte array, the last character of the string is a newline character (10). This of course makes the path invalid but is not easily recognizable in the error message. Add a .trim() after the .text and it should work with both, AntBuilder and Files and on both, Windows and Linux. If you would have done it on Windows there would be a 13 additionally before the 10.
Another maybe even better approach would be not to use git diff as you do, as it is a porcelain command. Porcelain command as their name suggests are not stable and their arguments and output can change any time. They are meant for human usage, not for scripts. Scripts should use plumbing commands which are much more stable in input and output. With the respective plumbing command you might not have had your problem.

How can I set up an alias for "git last" that accepts a number as an argument?

In Pro Git Scott Chacon gives some nice examples of some alias which might be helpful, including one that shows the last commit: git last which is the equivalent of log -1 HEAD:
git config --global alias.last 'log -1 HEAD'
It will show something like this:
$ git last
commit 66938dae3329c7aebe598c2246a8e6af90d04646
Author: Josh Goebel <dreamer3#example.com>
Date: Tue Aug 26 19:48:51 2008 +0800
test for current head
I read a few similar questions on stack overflow like Pass an argument to a Git alias command but have not still not been able to figure this out.
The best I could come up with was to modify my .gitconfig file as follows:
[alias]
last = log -1 HEAD
mylast = "!sh -c 'echo /usr/local/bin/git last -$0 HEAD'"
Then if I run this at the command line:
$ git mylast 12
I get this:
/usr/local/bin/git last -12 HEAD
That actually looks right. But if I remove the echo in front, it just hangs like it is waiting for input. I tried switching $0 for $1 but that didn't seem to help either.
What am I doing wrong?
Also, is there a way to set it up so that if I just type git last with no number then it would default to "1" ?
last = !sh -c 'git log "-${1:-1}" HEAD' -
This takes advantage of the shell parameter interpolation default syntax, ${var:-default} which substitutes default if the variable var is not set.

Using 'diff' (or anything else) to get character-level diff between text files

I'd like to use 'diff' to get a both line difference between and character difference.
For example, consider:
File 1
abcde
abc
abcccd
File 2
abcde
ab
abccc
Using diff -u I get:
## -1,3 +1,3 ##
abcde
-abc
-abcccd
\ No newline at end of file
+ab
+abccc
\ No newline at end of file
However, it only shows me that were changes in these lines. What I'd like to see is something like:
## -1,3 +1,3 ##
abcde
-ab<ins>c</ins>
-abccc<ins>d</ins>
\ No newline at end of file
+ab
+abccc
\ No newline at end of file
You get my drift.
Now, I know I can use other engines to mark/check the difference on a specific line. But I'd rather use one tool that does all of it.
Git has a word diff, and defining all characters as words effectively gives you a character diff. However, newline changes are ignored.
Example
Create a repository like this:
mkdir chardifftest
cd chardifftest
git init
echo -e 'foobarbaz\ncatdog\nfox' > file
git add -A; git commit -m 1
echo -e 'fuobArbas\ncat\ndogfox' > file
git add -A; git commit -m 2
Now, do git diff --word-diff=color --word-diff-regex=. master^ master and you'll get:
Note how both additions and deletions are recognized at the character level, while both additions and deletions of newlines are ignored.
You may also want to try one of these:
git diff --word-diff=plain --word-diff-regex=. master^ master
git diff --word-diff=porcelain --word-diff-regex=. master^ master
You can use:
diff -u f1 f2 |colordiff |diff-highlight
colordiff is a Ubuntu package. You can install it using sudo apt-get install colordiff.
diff-highlight is from git (since version 2.9). It is located in /usr/share/doc/git/contrib/diff-highlight/diff-highlight. You can put it somewhere in your $PATH.
Python's difflib is ace if you want to do this programmatically. For interactive use, I use vim's diff mode (easy enough to use: just invoke vim with vimdiff a b). I also occaisionally use Beyond Compare, which does pretty much everything you could hope for from a diff tool.
I haven't see any command line tool which does this usefully, but as Will notes, the difflib example code might help.
You can use the cmp command in Solaris:
cmp
Compare two files, and if they differ, tells the first byte and line number where they differ.
Python has convenient library named difflib which might help answer your question.
Below are two oneliners using difflib for different python versions.
python3 -c 'import difflib, sys; \
print("".join( \
difflib.ndiff( \
open(sys.argv[1]).readlines(),open(sys.argv[2]).readlines())))'
python2 -c 'import difflib, sys; \
print "".join( \
difflib.ndiff( \
open(sys.argv[1]).readlines(), open(sys.argv[2]).readlines()))'
These might come in handy as a shell alias which is easier to move around with your .${SHELL_NAME}rc.
$ alias char_diff="python2 -c 'import difflib, sys; print \"\".join(difflib.ndiff(open(sys.argv[1]).readlines(), open(sys.argv[2]).readlines()))'"
$ char_diff old_file new_file
And more readable version to put in a standalone file.
#!/usr/bin/env python2
from __future__ import with_statement
import difflib
import sys
with open(sys.argv[1]) as old_f, open(sys.argv[2]) as new_f:
old_lines, new_lines = old_f.readlines(), new_f.readlines()
diff = difflib.ndiff(old_lines, new_lines)
print ''.join(diff)
Coloured, character-level diff ouput
Here's what you can do with the the below script and diff-highlight (which is part of git):
#!/bin/sh -eu
# Use diff-highlight to show word-level differences
diff -U3 --minimal "$#" |
sed 's/^-/\x1b[1;31m-/;s/^+/\x1b[1;32m+/;s/^#/\x1b[1;34m#/;s/$/\x1b[0m/' |
diff-highlight
(Credit to #retracile's answer for the sed highlighting)
cmp -l file1 file2 | wc
Worked well for me. The leftmost number of the result indicates the number of characters that differ.
I also wrote my own script to solve this problem using the Longest common subsequence algorithm.
It is executed as such
JLDiff.py a.txt b.txt out.html
The result is in html with red and green coloring. Larger files do exponentually take a longer amount of time to process but this does a true character by character comparison without checking line by line first.
Python's difflib can do this.
The documentation includes an example command-line program for you.
The exact format is not as you specified, but it would be straightforward to either parse the ndiff-style output or to modify the example program to generate your notation.
As one comment to main answer said you don't have to commit to use git diff:
git diff --word-diff=color --word-diff-regex=. file1 file2
green would be the character that is added by the second file.
red would be the character that is added by the first file.
Here is an online text comparison tool:
http://text-compare.com/
It can highlight every single char that is different and continues compare the rest.
ccdiff is a convenient dedicated tool for the task. Here is what your example looks like with it:
By default, it highlights the differences in color, but it can be used in a console without color support too.
The package is included in the main repository of Debian:
ccdiff is a colored diff that also colors inside changed lines.
All command-line tools that show the difference between two files fall short in showing minor changes visuably useful. ccdiff tries to give the look and feel of diff --color or colordiff, but extending the display of colored output from colored deleted and added lines to colors for deleted and addedd characters within the changed lines.
Not a complete answer, but if cmp -l's output is not clear enough, you can use:
sed 's/\(.\)/\1\n/g' file1 > file1.vertical
sed 's/\(.\)/\1\n/g' file2 > file2.vertical
diff file1.vertical file2.vertical
If you keep your files in Git, you can diff between versions with the diff-highlight script, which will show different lines, with differences highlighted.
Unfortunately it only works when the number of lines removed matches the number of lines added - there is stub code for when lines don't match, so presumably this could be fixed in the future.
I think the simpler solution is always a good solution.
In my case, the below code helps me a lot. I hope it helps
anybody else.
#!/bin/env python
def readfile( fileName ):
f = open( fileName )
c = f.read()
f.close()
return c
def diff( s1, s2 ):
counter=0
for ch1, ch2 in zip( s1, s2 ):
if not ch1 == ch2:
break
counter+=1
return counter < len( s1 ) and counter or -1
import sys
f1 = readfile( sys.argv[1] )
f2 = readfile( sys.argv[2] )
pos = diff( f1, f2 )
end = pos+200
if pos >= 0:
print "Different at:", pos
print ">", f1[pos:end]
print "<", f2[pos:end]
You can compare two files with the following syntax at your favorite terminal:
$ ./diff.py fileNumber1 fileNumber2
Most of these answers mention using of diff-highlight, a Perl module. But I didn't want to figure out how to install a Perl module. So I made a few minor changes to it to be a self-contained Perl script.
You can install it using:
▶ curl -o /usr/local/bin/DiffHighlight.pl \
https://raw.githubusercontent.com/alexharv074/scripts/master/DiffHighlight.pl
And the usage (if you have the Ubuntu colordiff mentioned in zhanxw's answer):
▶ diff -u f1 f2 | colordiff | DiffHighlight.pl
And the usage (if you don't):
▶ diff -u f1 f2 | DiffHighlight.pl

Resources