How do I do a one way diff in Linux? - linux

How do I do a one way diff in Linux?
Normal behavior of diff:
Normally, diff will tell you all the differences between a two files. For example, it will tell you anything that is in file A that is not in file B, and will also tell you everything that is in file B, but not in file A. For example:
File A contains:
cat
good dog
one
two
File B contains:
cat
some garbage
one
a whole bunch of garbage
something I don't want to know
If I do a regular diff as follows:
diff A B
the output would be something like:
2c2
< good dog
---
> some garbage
4c4,5
< two
---
> a whole bunch of garbage
> something I don't want to know
What I am looking for:
What I want is just the first part, for example, I want to know everything that is in File A, but not file B. However, I want it to ignore everything that is in file B, but not in file A.
What I want is the command, or series of commands:
???? A B
that produces the output:
2c2
< good dog
4c4,5
< two
I believe a solution could be achieved by piping the output of diff into sed or awk, but I am not familiar enough with those tools to come up with a solution. I basically want to remove all lines that begin with --- and >.
Edit: I edited the example to account for multiple words on a line.
Note: This is a "sub-question" of: Determine list of non-OS packages installed on a RedHat Linux machine
Note: This is similar to, but not the same as the question asked here (e.g. not a dupe):
One-way diff file

An alternative, if your files consist of single-line entities only, and the output order doesn't matter (the question as worded is unclear on this), would be:
comm -23 <(sort A) <(sort B)
comm requires its inputs to be sorted, and the -2 means "don't show me the lines that are unique to the second file", while -3 means "don't show me the lines that are common between the two files".
If you need the "differences" to be presented in the order they occur, though, the above diff / awk solution is ok (although the grep bit isn't really necessary - it could be diff A B | awk '/^</ { $1 = ""; print }'.
EDIT: fixed which set of lines to report - I read it backwards originally...

As stated in the comments, one mostly correct answer is
diff A B | grep '^<'
although this would give the output
< good dog
< two
rather than
2c2
< good dog
4c4,5
< two

diff A B|grep '^<'|awk '{print $2}'
grep '^<' means select rows start with <
awk '{print $2}' means select the second column

If you want to also see the files in question, in case of diffing folders, you can use
diff public_html temp_public_html/ | grep '^[^>]'
to match all but lines starting with >

Related

Comparing two different files

say I have two data files that could look like this.
A dog 3
A cat 1
A mouse 4
A chicken 4
and
B tiger 2
B chicken 1
B dog 3
B wolf 2
How would I be able to look at only the animals that are common in both files? Ideally, I would like the output to look something like
dog 3 3
chicken 4 1
But even outputting just the ones along with its value that are common in both files is good enough for me. Thanks.
this one-liner should do:
awk 'NR==FNR{a[$2]=$2 FS $3;next}a[$2]{print a[$2],$3}' f1 f2
#Kent has done some serious one line magic. Anyway, I did a shell script you could try. Simply run ./script[file1] [file2]
#!/bin/bash
# Read input
words1=$(cat $1 | sed -r "s/.*\ (.*)\ .*/\1/")
val1=$(cat $1 | sed -r "s/.*\ .*\ (.*)/\1/")
words2=$(cat $2 | sed -r "s/.*\ (.*)\ .*/\1/")
val2=$(cat $2 | sed -r "s/.*\ .*\ (.*)/\1/")
# Convert to array
words1=($words1)
val1=($val1)
words2=($words2)
val2=($val2)
# Iterate and print result
for i in "${!words1[#]}"; do
for j in "${!words2[#]}"; do
if [ ${words1[i]} == ${words2[j]} ]; then
echo "${words1[i]} ${val1[i]} ${val2[j]}"
break
fi
done
done
exit 0
I'm not sure why this is a linux/unix question. It looks like what you need is a simple program that you'll need to write, as this isn't a basic compare-two-files issue that would be generally covered by applications like Beyond Compare.
Let's assume these files are basic text files that contain one record per line with space-delimited values. (Use space as the delimiter is dangerous, but that's what you have above). You'll need to read in each file, storing both files as [iterable collection], and have each object either be a string that you act on in each run of a loop, or that you break into pieces as you build from the file. You'll need to compare [linepart 1] from the first file to each [linepart 1] in the second file, and whenever you find a match, break and output [linepart 1] [A.linepart 2] [B.linepart 2].
I can't think of any existing program that would do this for you, but it's fairly simple (assuming you think file IO is simple) to handle with Java, C#, etc.

Print previous line if condition is met

I would like to grep a word and then find the second column in the line and check if it is bigger than a value. Is yes, I want to print the previous line.
Ex:
Input file
AAAAAAAAAAAAA
BB 2
CCCCCCCCCCCCC
BB 0.1
Output
AAAAAAAAAAAAA
Now, I want to search for BB and if the second column (2 or 0.1) in that line is bigger than 1, I want to print the previous line.
Can somebody help me with grep and awk? Thanks. Any other suggestions are also welcome. Thanks.
This can be a way:
$ awk '$1=="BB" && $2>1 {print f} {f=$1}' file
AAAAAAAAAAAAA
Explanation
$1=="BB" && $2>1 {print f} if the 1st field is exactly BB and 2nd field is bigger than 1, then print f, a stored value.
{f=$1} store the current line in f, so that it is accessible when reading the next line.
Another option: reverse the file and print the next line if the condition matches:
tac file | awk '$1 == "BB" && $2 > 1 {getline; print}' | tac
Concerning generality
I think it needs to be mentioned that the most general solution to this class of problem involves two passes:
the first pass to add a decimal row number ($REC) to the front of each line, effectively grouping lines into records by $REC
the second pass to trigger on the first instance of each new value of $REC as a record boundary (resetting $CURREC), thereafter rolling along in the native AWK idiom concerning the records to follow matching $CURREC.
In the intermediate file, some sequence of decimal digits followed by a separator (for human reasons, typically an added tab or space) is parsed (aka conceptually snipped off) as out-of-band with respect to the baseline file.
Command line paste monster
Even confined to the command line, it's an easy matter to ensure that the intermediate file never hits disk. You just need to use an advanced shell such as ZSH (my own favourite) which supports process substitution:
paste <( <input.txt awk "BEGIN { R=0; N=0; } /Header pattern/ { N=1; } { R=R+N; N=0; print R; }" ) input.txt | awk -f yourscript.awk
Let's render that one-liner more suitable for exposition:
P="/Header pattern/"
X="BEGIN { R=0; N=0; } $P { N=1; } { R=R+N; N=0; print R; }"
paste <( <input.txt awk $X ) input.txt | awk -f yourscript.awk
This starts three processes: the trivial inline AWK script, paste, and the AWK script you really wanted to run in the first place.
Behind the scenes, the <() command line construct creates a named pipe and passes the pipe name to paste as the name of its first input file. For paste's second input file, we give it the name of our original input file (this file is thus read sequentially, in parallel, by two different processes, which will consume between them at most one read from disk, if the input file is cold).
The magic named pipe in the middle is an in-memory FIFO that ancient Unix probably managed at about 16 kB of average size (intermittently pausing the paste process if the yourscript.awk process is sluggish in draining this FIFO back down).
Perhaps modern Unix throws a bigger buffer in there because it can, but it's certainly not a scarce resource you should be concerned about, until you write your first truly advanced command line with process redirection involving these by the hundreds or thousands :-)
Additional performance considerations
On modern CPUs, all three of these processes could easily find themselves running on separate cores.
The first two of these processes border on the truly trivial: an AWK script with a single pattern match and some minor bookkeeping, paste called with two arguments. yourscript.awk will be hard pressed to run faster than these.
What, your development machine has no lightly loaded cores to render this master shell-master solution pattern almost free in the execution domain?
Ring, ring.
Hello?
Hey, it's for you. 2018 just called, and wants its problem back.
2020 is officially the reprieve of MTV: That's the way we like it, magic pipes for nothing and cores for free. Not to name out loud any particular TLA chip vendor who is rocking the space these days.
As a final performance consideration, if you don't want the overhead of parsing actual record numbers:
X="BEGIN { N=0; } $P { N=1; } { print N; N=0; }"
Now your in-FIFO intermediate file is annotated with just an additional two characters prepended to each line ('0' or '1' and the default separator character added by paste), with '1' demarking first line in record.
Named FIFOs
Under the hood, these are no different than the magic FIFOs instantiated by Unix when you write any normal pipe command:
cat file | proc1 | proc2 | proc2
Three unnamed pipes (and a whole process devoted to cat you didn't even need).
It's almost unfortunate that the truly exceptional convenience of the default stdin/stdout streams as premanaged by the shell obscures the reality that paste $magictemppipe1 $magictemppipe2 bears no additional performance considerations worth thinking about, in 99% of all cases.
"Use the <() Y-joint, Luke."
Your instinctive reflex toward natural semantic decomposition in the problem domain will herewith benefit immensely.
If anyone had had the wits to name the shell construct <() as the YODA operator in the first place, I suspect it would have been pressed into universal service at least a solid decade ago.
Combining sed & awk you get this:
sed 'N;s/\n/ /' < file |awk '$3>1{print $1}'
sed 'N;s/\n/ / : Combine 1st and 2nd line and replace next line char with space
awk '$3>1{print $1}': print $1(1st column) if $3(3rd column's value is > 1)

How to determine if the content of one file is included in the content of another file

First, my apologies for what is perhaps a rather stupid question that doesn't quite belong here.
Here's my problem: I have two large text files containing a lot of file names, let's call them A and B, and I want to determine if A is a subset of B, disregarding order, i.e. for each file name in A, find if file name is also in B, otherwise A is not a subset.
I know how to preprocess the files (to remove anything but the file name itself, removing different capitalization), but now I'm left to wonder if there is a simple way to perform the task with a shell command.
Diff probably doesn't work, right? Even if I 'sort' the two files first, so that at least the files that are present in both will be in the same order, since A is probably a proper subset of B, diff will just tell me that every line is different.
Again, my apologies if the question doesn't belong here, and in the end, if there is no easy way to do it I will just write a small program to do the job, but since I'm trying to get a better handle on the shell commands, I thought I'd ask here first.
Do this:
cat b | sort -u | wc
cat a b | sort -u | wc
If you get the same result, a is a subset of b.
Here's how to do it in awk
awk '
# read A, the supposed subset file
FNR == NR {a[$0]; next}
# process file B
$0 in a {delete a[$0]}
END {if (length(a) == 0) {print "A is a proper subset of B"}}
' A B
Test if an XSD file is a subset of a WSDL file:
xmllint --format file.wsdl | awk '{$1=$1};1' | sort -u | wc
xmllint --format file.wsdl file.xsd | awk '{$1=$1};1' | sort -u | wc
This adapts the elegant concept of RichieHindle's prior answer using:
xmllint --format instead of cat, to pretty print the XML so each XML element was on one line, as required by sort -u | wc. Other pretty printing commands might work here, e.g. jq . for json.
an awk command to normalise the whitespace: strip leading and trailing (because the indentation is different in both files), and collapse internal. Caveat: does not consider XML attribute order within the element.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

sort across multiple files in linux

I have multiple (many) files; each very large:
file0.txt
file1.txt
file2.txt
I do not want to join them into a single file because the resulting file would be 10+ Gigs. Each line in each file contains a 40-byte string. The strings are fairly well ordered right now, (about 1:10 steps is a decrease in value instead of an increase).
I would like the lines ordered. (in-place if possible?) This means some of the lines from the end of file0.txt will be moved to the beginning of file1.txt and vice versa.
I am working on Linux and fairly new to it. I know about the sort command for a single file, but am wondering if there is a way to sort across multiple files. Or maybe there is a way to make a pseudo-file made from smaller files that linux will treat as a single file.
What I know can do:
I can sort each file individually and read into file1.txt to find the value larger than the largest in file0.txt (and similarly grab the lines from the end of file0.txt), join and then sort.. but this is a pain and assumes no values from file2.txt belong in file0.txt (however highly unlikely in my case)
Edit
To be clear, if the files look like this:
f0.txt
DDD
XXX
AAA
f1.txt
BBB
FFF
CCC
f2.txt
EEE
YYY
ZZZ
I want this:
f0.txt
AAA
BBB
CCC
f1.txt
DDD
EEE
FFF
f2.txt
XXX
YYY
ZZZ
I don't know about a command doing in-place sorting, but I think a faster "merge sort" is possible:
for file in *.txt; do
sort -o $file $file
done
sort -m *.txt | split -d -l 1000000 - output
The sort in the for loop makes sure the content of the input files is sorted. If you don't want to overwrite the original, simply change the value after the -o parameter. (If you expect the files to be sorted already, you could change the sort statement to "check-only": sort -c $file || exit 1)
The second sort does efficient merging of the input files, all while keeping the output sorted.
This is piped to the split command which will then write to suffixed output files. Notice the - character; this tells split to read from standard input (i.e. the pipe) instead of a file.
Also, here's a short summary of how the merge sort works:
sort reads a line from each file.
It orders these lines and selects the one which should come first. This line gets sent to the output, and a new line is read from the file which contained this line.
Repeat step 2 until there are no more lines in any file.
At this point, the output should be a perfectly sorted file.
Profit!
It isn't exactly what you asked for, but the sort(1) utility can help, a little, using the --merge option. Sort each file individually, then sort the resulting pile of files:
for f in file*.txt ; do sort -o $f < $f ; done
sort --merge file*.txt | split -l 100000 - sorted_file
(That's 100,000 lines per output file. Perhaps that's still way too small.)
I believe that this is your best bet, using stock linux utilities:
sort each file individually, e.g. for f in file*.txt; do sort $f > sorted_$f.txt; done
sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix>, where <lines> is the number of lines per file, and <prefix> is the filename prefix. (The -d tells split to use numeric suffixes).
The -m option to sort lets it know the input files are already sorted, so it can be smart.
mmap() the 3 files, as all lines are 40 bytes long, you can easily sort them in place (SIP :-). Don't forget the msync at the end.
If the files are sorted individually, then you can use sort -m file*.txt to merge them together - read the first line of each file, output the smallest one, and repeat.

Resources