diff line-format: Show delete, new, and changed lines? - linux

backup.txt
user1:password:17002:0:99:7:::
user2:password:17003:0:99:7:::
user3:password:17004:0:99:7:::
"main.txt" is same with "backup.txt". If I rename "user1", add a new user, and remove "user2" in "main.txt". "main.txt" seems like:
username1:password:17002:0:99:7:::
user3:password:17004:0:99:7:::
newUser:password:17005:0:99:7:::
after that I use following command for compare two files:
diff --unchanged-line-format="" --old-line-format=":%dn: %L" --new-line-format=":%dn: %L" backup.txt main.txt
...with the actual output:
:1: user1:password:17002:0:99:7:::
:2: user2:password:17003:0:99:7:::
:1: username1:password:17002:0:99:7:::
:3: newUser:password:17005:0:99:7:::
However, my intended output was:
:1c: user1:password:17002:0:99:7:::
:2d: user2:password:17003:0:99:7:::
:1c: username1:password:17002:0:99:7:::
:3a: newUser:password:17005:0:99:7:::
like this. These characters are enable for default "diff" command using. How can I enable these characters for line formatting. Is it possible?

The LTYPEs offered by both BSD and GNU diff are "old", "new", and "unchanged". You thus can't distinguish between "new" and "changed".
That said, to get some distinctions in your format strings, you need to fill them out correctly. In %dn, both the d and the n are consumed (the former specifying a decimal value, the n specifying that it refer to the line number, or the number of lines modified, depending on context). Thus, if you want any extra characters (such as a c, d or a), you need to add those characters after that substitution has complete.
# declaring functions to allow testing without creating files on-disk
backup () { printf '%s\n' user1:password:17002:0:99:7::: user2:password:17002:0:99:7::: user3:password:17002:0:99:7:::; }
main () { printf '%s\n' username1:password:17002:0:99:7::: user3:password:17004:0:99:7::: newUser:password:17005:0:99:7:::; }
diff \
--unchanged-line-format=":%dnu: %L" \
--old-line-format=":%dnd: %L" \
--new-line-format=":%dnn: %L" \
<(backup) <(main)

Related

Way to replace one variable with another in a string

I need to replace one variable with another variable in a multiple strings.
For example:
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in string1 string2 string3; do
x="$(echo "$str" | sed 's/[a-zA-Z]//g')" # extracting a character between letters
sed 's/$x/$y/'$str # I tried this, but it does not work at all.
echo "$str"
done
Expecting output:
One;two
three;four
five;six
In my output, nothing changes:
One,two
three.four
five:six
You can use bash's substitution operator instead of sed. And simply replace anything that isn't a letter with $y.
#!/bin/bash
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in "$string1" "$string2" "$string3"; do
x=${str//[^a-zA-Z]+/$y}
echo "$x"
done
Output is:
One;two
three;four
five;six
Note that your general approach wouldn't work if the input string has muliple delimiters, e.g. One,two,three. When you remove all the letters you get ,,, but that doesn't appear anywhere in the string.
Addressing issues with OP's current code:
referencing variables requires a leading $, preferably a pair of {}, and (usually) double quotes (eg, to insure embedded spaces are considered as part of the variable's value)
sed can take as input a) a stream of text on stdin, b) a file, c) process substitution or d) a here-document/here-string
when building a sed script that includes variable refences the sed script must be wrapped in double quotes (not single quotes)
Pulling all of this into OP's current code we get:
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in "${string1}" "${string2}" "${string3}"; do # proper references of the 3x "stringX" variables
x="$(echo "$str" | sed 's/[a-zA-Z]//g')"
sed "s/$x/$y/" <<< "${str}" # feeding "str" as here-string to sed; allowing variables "x/y" to be expanded in the sed script
echo "$str"
done
This generates:
One;two # generated by the 2nd sed call
One,two # generated by the echo
;hree.four # generated by the 2nd sed call
three.four # generated by the echo
five;six # generated by the 2nd sed call
five:six # generated by the echo
OK, so we're now getting some output but there are obviously some issues:
the results of the 2nd sed call are being sent to stdout/terminal as opposed to being captured in a variable (presumably the str variable - per the follow-on echo ???)
for string2 we find that x=. which when plugged into the 2nd sed call becomes sed "s/./;/"; from here the . matches the first character it finds which in this case is the 1st t in string2, so the output becomes ;hree.four (and the . is not replaced)
dynamically building sed scripts without knowing what's in x (and y) becomes tricky without some additional coding; instead it's typically easier to use parameter substitution to perform the replacements for us
in this particular case we can replace both sed calls with a single parameter substitution (which also eliminates the expensive overhead of two subprocesses for the $(echo ... | sed ...) call)
Making a few changes to OP's current code we can try:
string1="One,two"
string2="three.four"
string3="five:six"
y=";"
for str in "${string1}" "${string2}" "${string3}"; do
x="${str//[^a-zA-Z]/${y}}" # parameter substitution; replace everything *but* a letter with the contents of variable "y"
echo "${str} => ${x}" # display old and new strings
done
This generates:
One,two => One;two
three.four => three;four
five:six => five;six

Use bash to find line in java files which include a pattern, and then replace another part of the line

I have a directory that includes a lot of java files, and in each file I have a class variable:
String system = "x";
I want to be able to create a bash script which I execute in the same directory, which will go to only the java files in the directory, and replace this instance of x, with y. Here x and y are a word. Now this may not be the only instance of the word x in the java script, however it will definitely be the first.
I want to be able to execute my script in the command line similar to:
changesystem.sh -x -y
This way I can specify what the x should be, and the y I wish to replace it with. I found a way to find and print the line number at which the first instance of a pattern is found:
awk '$0 ~ /String system/ {print NR}' file
I then found how to replace a substring on a given line using:
awk 'NR==line_number { sub("x", "y") }'
However, I have not found a way to combine them. Maybe there is also an easier way? Or even, a better and more efficient way?
Any help/advice will be greatly appreciated
You may create a changesystem.sh file with the following GNU awk script:
#!/bin/bash
for f in *.java; do
awk -i inplace -v repl="$1" '
!x && /^\s*String\s+system\s*=\s*".*";\s*$/{
lwsp=gensub(/\S.*/, "", 1);
print lwsp"String system = \""repl"\";";
x=1;next;
}1' "$f";
done;
Or, with any awk:
#!/bin/bash
for f in *.java; do
awk -v repl="$1" '
!x && /^[[:space:]]*String[[:space:]]+system[[:space:]]*=[[:space:]]*".*";[[:space:]]*$/{
lwsp=$0; sub(/[^[:space:]].*/, "", lwsp);
print lwsp"String system = \""repl"\";";
x=1;next
}1' "$f" > tmp && mv tmp "$f";
done;
Then, make the file executable:
chmod +x changesystem.sh
Then, run it like
./changesystem.sh 'new_value'
Notes:
for f in *.java; do ... done iterates over all *.java files in the current directory
-i inplace - GNU awk feature to perform replacement inline (not available in a non-GNU awk)
-v repl="$1" passes the first argument of the script to the awk command
!x && /^\s*String\s+system\s*=\s*".*";\s*$/ - if x is false and the record starts with any amount of whitespace (\s* or [[:space:]]*), then String, any 1+ whitespaces, system, = enclosed with any zero or more whitesapces, and then a " char, then has any text and ends with "; and any zero or more whitespaces, then
lwsp=gensub(/\S.*/, "", 1); puts the leading whitespace in the lwsp variable (it removes all text starting with the first non-whitespace char from the line matched)
lwsp=$0; sub(/[^[:space:]].*/, "", lwsp); - same as above, just in a different way since gensub is not supported in non-GNU awk and sub modifies the given input string (here, lwsp)
{print "String system = \""repl"\";";x=1;next}1 - prints the String system = " + the replacement string + ";, assigns 1 to x, and moves to the next line, else, just prints the line as is.
You don't need to pre-compute the line number. The whole job can be done by one not-too-complicated sed command. You probably do want to script it, though. For example:
#!/bin/bash
[[ $# -eq 3 ]] || {
echo "usage: $0 <context regex> <target regex> <replacement text>" 1>&2
exit 1
}
sed -si -e "/$1/ { s/\\<$2\\>/$3/; t1; p; d; :1; n; b1; }" ./*.java
That assumes that the files to modify are java source files in the current working directory, and I'm sure you understand the (loose) argument check and usage message.
As for the sed command itself,
the -s option instructs sed to treat each argument as a separate stream, instead of operating as if by concatenating all the inputs into one long stream.
the -i option instructs sed to modify the designated files in-place.
the sed expression takes the default action for each line (printing it verbatim) unless the line matches the "context" pattern given by the first script argument.
for lines that do match the context pattern,
s/\\<$2\\>/$3/ - attempt to perform the wanted substitution
the \< and \> match word start and end boundaries, respectively, so that the specified pattern will not match a partial word (though it can match multiple complete words if the target pattern allows)
t1 - if a substitution was made, then branch to label 1, otherwise
p; d - print the current line and immediately start the next cycle
:1; n; b1 - label 1 (reachable only by branching): print the current line and read the next one, then loop back to label 1. This prints the remainder of the file without any more tests or substitutions.
Example usage:
/path/to/replace_first.sh 'String system' x y
It is worth noting that that does expose the user to some details of seds interpretation of regular expressions and replacement text, though that does not manifest for the example usage.
Note that that could be simplified by removing the context pattern bit if you are sure you want to modify the overall first appearance of the target in each file. You could also hard-code the context, the target pattern, and/or the replacement text. If you hard-code all three then the script would no longer need any argument handling or checking.

Trailers option in git --pretty option

I was trying to extract a summary of contributions from git log and create a concise summary of that and create an excel/csv out of it to present reports.
I did try
git log --after="2020-12-10" --pretty=format:'"%h","%an","%ae","%aD","%s","(trailers:key="Reviewed By")"'
and the CSV looks like with a blank CSV column at the end.
...
"7c87963cc","XYZ","xyz#abc.com","Tue Dec 8 17:40:13 2020 +0000","[TTI] Add support for target hook in compiler.", ""
...
and the git log looks something like
commit 7c87963cc
Author: XYZ <xyz#abc.com>
Date: Tue Dec 8 17:40:13 2020 +0000
[TTI] Add support for target hook in compiler.
This adds some code in the TabeleGen ...
This is my body of commit.
Reviewed By: Sushant
Differential Revision: https://codereviews.com/DD8822
What I couldn't be successful was in extracting the Differential Revision string using the (trailers:key="Reviewed By") command.
I couldn't find much on how to get this working.
I checked the git manual and I did try what it explains.
Is there something I might be missing in this command?
The expected output should have the text
https://codereviews.com/DD8822 at the last position in the above CVS output.
I'm not sure but:
trailer keys cannot have whitespaces (therefore Reviewed By -> Reviewed-By, and Differential Revision -> Differential-Revision);
trailers should not be delimited by new lines, but separated from the commit commit message (therefore Reviewed By from your question is not considered as a trailer).
I would also not recommend using CSV, but using TSV instead: git output is not aware of CSV syntax (semi-colons and commas escaping), therefore the output document may be generated unparsable.
If your commit messages would look like this (- instead of spaces, no new line delimiters):
commit 7c87963cc
Author: XYZ <xyz#abc.com>
Date: Tue Dec 8 17:40:13 2020 +0000
[TTI] Add support for target hook in compiler.
This adds some code in the TabeleGen ...
This is my body of commit.
Reviewed-By: Sushant
Differential-Revision: https://codereviews.com/DD8822
Then the following command would work for you:
git log --pretty=format:'%h%x09%an%x09%ae%x09%aD%x09%s%x09%(trailers:key=Reviewed-By,separator=%x20,valueonly)%x09%(trailers:key=Differential-Revision,separator=%x20,valueonly)'
producing short commit id, author name, author email, date, commit message, trailer Reviewed-By, and trailer Differential-Revision to your tab-separated values output.
If you may not change the old commit messages because your history is not safe for doing this (it's published, pulled by peers, your tools are bound to the published commit hashes), then you have to process the git log output with sed, awk, perl, or any other text-transforming tool to generate your report. Say, process something like git log --pretty=format:'%x02%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%n%B' where lines between ^B (STX) and EOF should be analyzed somehow (filtered for the trailers you are interestged in), then joined to their group lines starting with ^B, and then character replaced to replace field and entry separators with \t and no character respectively.
But again, if you may edit the history by fixing commit message trailers (not sure how much it may affect), I'd recommend you do that and then reject the idea of extra scripts processing trailers that are not recognized by git-interpret-trailers and simply fix the commit messages.
Edit 1 (text tools)
If rewriting the history is not an option, then implementing some scripts may help you out. I'm pretty weak at writing powerful sed/awk/perl scripts, but let me try.
git log --pretty=format:'%x02%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%n%B' \
| gawk -f trailers.awk \
| sed '$!N;s/\n/\x1F/' \
| sed 's/[\x02\x1E]//g' \
| sed 's/\x1F/\x09/g'
How it works:
git generates a log made of data delimited with standard C0 C1 codes assuming there are no such characters your commit messages (STX, RS and US -- I don't really know if it a good place to use them like that and if I apply them semantically correct);
gawk filters the log output trying to parse STX-started groups and extract the trailers, generating "two-rowed" output (each odd line for regular data, each even line for comma-joined trailer values even for missing trailers);
sed joins odd and even lines by pairs (credits go to Karoly Horvath);
sed removes STX and RS;
sed replaces US to TAB.
Here is the trailers.awk (again I'm not an awk guy and have no idea how idiomatic the following script it, but it seems to work):
#!/usr/bin/awk -f
BEGIN {
FIRST = 1
delete TRAILERS
}
function print_joined_array(array) {
if ( !length(array) ) {
return
}
for ( i in array ) {
if ( i > 0 ) {
printf(",")
}
printf("%s", array[i])
}
printf("\x1F")
}
function print_trailers() {
if ( FIRST ) {
FIRST = 0
return
}
print_joined_array(TRAILERS["Reviewed By"])
print_joined_array(TRAILERS["Differential Revision"])
print ""
}
/^\x02/ {
print_trailers()
print $0
delete TRAILERS
}
match($0, /^([-_ A-Za-z0-9]+):\s+(.*)\s*/, M) {
TRAILERS[M[1]][length(TRAILERS[M[1]])] = M[2]
}
END {
print_trailers()
}
A couple of words how the awk script works:
it assumes that records that do not require processing are starting with STX;
it tries to grep each non-"STX" line for a Key Name: Value pattern and saves the found result to a temporary array TRAILERS (that serves actually as a multimap, like Map<String, List<String>> in Java) for each record;
each record is written as is, but trailers are written either before detecting a new record or at EOF.
Edit 2 (better awk)
Well, I'm really weak at awk, so once I read more about awk internal variables, I figured out the awk script can be reimplemented entirely and produce a ready to use TSV-like output itself without any post-processing with sed or perl. So the shorter and improved version of the script is:
#!/bin/bash
git log --pretty=format:'%x1E%h%x1F%an%x1F%ae%x1F%aD%x1F%s%x1F%B%x1E' \
| gawk -f trailers.awk
#!/usr/bin/awk -f
BEGIN {
RS = "\x1E"
FS = "\x1F"
OFS = "\x09"
}
function extract(array, trailer_key, __buffer) {
for ( i in array ) {
if ( index(array[i], trailer_key) > 0 ) {
if ( length(__buffer) > 0 ) {
__buffer = __buffer ","
}
__buffer = __buffer substr(array[i], length(trailer_key))
}
}
return __buffer
}
NF > 1 {
split($6, array, "\n")
print $1, $2, $3, $4, $5, extract(array, "Reviewed By: "), extract(array, "Differential Revision: ")
}
Much more concise, easier to read, understand and maintain.

Cannot replace string from a file with another string shell

I have a file filename with 2 lines:
2018-Feb-22 06:02:01.1234|AVC-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|
2018-Feb-22 07:02:01.1234|BCV-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|
I want to concat string
"Warning: Time elapsed:,3444, is smaller than Name2:44222"
At the end of the line which is equal with
Var1="2018-Feb-22 06:02:01.1234|AVC-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|"
Or has the following pattern
Var2="2018-Feb-22 06:02:01.1234|AVC-00123HHGF|"
And then filename will contain
2018-Feb-22 06:02:01.1234|AVC-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|"Warning: Time elapsed:,3444, is smaller than Name2:44222"
2018-Feb-22 07:02:01.1234|BCV-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|
This is what i've tried:
Var3='2018-Feb-22 06:02:01.1234|AVC-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|"Warning: Time elapsed:,3444, is smaller than Name2:44222"'
sed -i 's/'"$Var1"'/'"$Var3"'/' filename
sed -i "s/$Var1/$Var3/" filename
Var4='"Warning: Time elapsed:,3444, is smaller than Name2:44222"'
sed -i "/$Var1/a $Var4" filename
But nothing happens. Not even an error.
It's there any other way to do this? I need to keep the same order of the lines within filename.
UPDATE: i've gave up on using sed and tried a less optimal solution, but it works.
I have 2 files:
File_to_change
File_with_lines_to_add
While read line; do
Prkey=##calculate pk
N=0
While read linetoadd; do
Prmkey=##calculate pk
If [ "$Prkey" =="$Prmkey" ]; then
N=1
echo "$line$linetoadd">>outfile
Fi
Done < File_with_lines_to_add
If [ "$N" == "0" ]; then
echo "$line">>outfile
Fi
Done < File_to_change
suffix="Warning: Time elapsed:,3444, is smaller than Name2:44222"
pattern="AVC-"
sed -E "/$pattern/s/^(.*)$/\1$suffix/" filename
2018-Feb-22 06:02:01.1234|AVC-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|Warning: Time elapsed:,3444, is smaller than Name2:44222
2018-Feb-22 07:02:01.1234|BCV-00123HHGF|427654|Default|Name1 [1]|2334|2344444|(00:00:00.45567)|
sed -E : -E allows later usage of () for grouping, without masking
"..." : the command. Double qoutes allow $x expressions to be evaluated by the shell, before sed gets them to read
/$pattern/ : look for this pattern and only act, if pattern is found
s/a/b/ : substitute expression a with b
/^(.*)$/ : our a-expression
^ Start of line
(.*) : an arbitrary character, and in arbitrary count, captured as a group for later reference as \1, since it's the first group.
$ : end of line
/\1$suffix/ : our b-expression
\1 : what matched above the (.*) pattern
$suffix : what was replaced by the shell
filename
Note that many keywords (better key-characters, since most of them are only 1 character long) change their meaning by context, and quotation is important, and flags like -E, -i, -r.
For example, the $ can be interpreted by the shell, but if not touched, in can mean 'end of line' or 'last line' or 'Dollar Sign'.
'+' can mean at least one, '.' can mean 'any character', a \ is used for masking in sed, to introduce back references like \1. It's a mass but very useful to learn.
Use sed with care.
The vertical bar in "34|AVC-00123HHGF|42" will be interpreted by sed als alternative, either 4 or A and either F or 4. So that would match:
"34VC-00123..."
"3AVC-00123.."
"...HHGF2"
"...HHG42"
which makes for 4 combinations of 2x2 alternatives, none of them matching "34|AVC-00123HHGF|42". How to handle that? Well - masking:
"34\|AVC-00123HHGF\|42"
which might again be done by other sed programs, but you guess where that leads to.
"34.AVC-00123HHGF.42" would match, so make reasonable paranoid decisions, and test and control. :)
Try this:
sed -i '' '/2018-Feb-22 06:02:01.1234|AVC-00123HHGF|/s/$/\"Warning: Time elapsed:,3444, is smaller than Name2:44222\"/' gilename
If that doesn't work, retreat to something simpler, tell us what happens when you try this:
sed 's/2018/XXXX/' filename

How to display line numbers when comparing files with linux "comm" tool

I would like to diff two very large files (multi-GB), using linux command line tools, and see the line numbers of the differences. The order of the data matters.
I am running on a Linux machine and the standard diff tool gives me the "memory exhausted" error. -H had no effect.
In my application, I only need to stream the diff results. That is, I just want to visually look at the first few differences, I don't need to inspect the entire file. If there are differences, a quick glance will tell me what is wrong.
'comm' seems well suited to this, but it does not display line numbers of the differences.
In general, my multi-GB files only have a few hundred lines that are different, the rest of the file is the same.
Is there a way to get comm to dump the line number? Or a way to make diff run without loading the entire file into memory? (like cutting the input files into 1k blocks, without actually creating a million 1k-files in my filesystem and cluttering everything up)?
I won't use comm, but as you said WHAT you need, in addition to HOW you thought you should do it, I'll focus on the "WHAT you need" instead :
An interesting way would be to use paste and awk : paste can show 2 files "side by side" using a separator. If you use \n as separator, it display the 2 files with line 1 of each , followed by line 2 of each etc.
So the script you could use could be simply (once you know that there are the same number of lines in each files) :
paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
(Interrestingly, this solution will allow be easily extended to do a diff of N files in a single read, whatever the sizes of the N files are ... just adding a check that all have the same amount of lines before doing the comparison steps (otherwise "paste" will in the end show only lines from the bigger files))
Here is a (short) example, to show how it works:
$ cat > /tmp/file1
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ cat > /tmp/file2
A
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
$ paste -d '\n' /tmp/file1 /tmp/file2
A
A
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
E
E
$ paste -d '\n' /tmp/file1 /tmp/file2 | awk '
NR%2 { linefirstfile=$0 ; }
!(NR%2) { if ( $0 != linefirstfile )
{ print "line",NR/2,": "; print linefirstfile ; print $0 ; } }'
line 2 :
C %FORGOT% fmsdflmdflskdf dfldksdlfkdlfkdlkf
C sdflmsdflmsdfsklmdfksdmfksd fmsdflmdflskdf dfldksdlfkdlfkdlkf
If it happens that the files don't have the same amount of lines, then you can add first a check of the number of line, comparing $(wc -l /tmp/file1) and $(wc -l /tmp/file2) , and only do the past...|awk if they have the same amount of line, to ensure the "paste" works correctly by always having one line of each! (But of course, in that case, there will be one (fast!) entire read of each file...)
You can easily adjust it to display exactly as you need it to. And you could quit after the Nth difference (either automatically, with a counter in the awk loop, or by pressing CTRL-C when you saw enough)
Which versions of diff have you tried? GNU diff has a "--speed-large-files" which may help.
The comm tool assumes the lines are sorted.

Resources