Diff command not working logically - linux

Original file contains:
B
RBWBW
RWRWWRBWWWBRBWRWWBWWB
My file contains :
B
RBWBW
RWRWWRBWWWBRBWRWWBWWB
However when i use the command diff original myfile it shows following:
1,3c1,3
< B
< RBWBW
< RWRWWRBWWWBRBWRWWBWWB
---
> B
> RBWBW
> RWRWWRBWWWBRBWRWWBWWB
When i put -w tag (diff original myfile -w) it shows no differences... but I'm absolutely sure these two files do not have whitespace/endline differences. What's the problem?

These texts are equal.
Maybe you have extra white spaces.
try
diff -w -B file1.txt file2.txt
-w Ignore all white space.
-B Ignore changes whose lines are all blank.

As seen in the comments, you must have some different line endings, caused because of an original file coming from a DOS system. That's why using -w dropped the end of the line and files matched.
To repair the file, execute:
dos2unix file

Look at them in Hex format. This way you can really see if they are the same.

Related

What's confusing both grep and ack?

Try this: download https://www.mathworks.com/matlabcentral/fileexchange/19-delta-sigma-toolbox
In the unzipped folder, I get the following results:
ack --no-heading --no-break --matlab dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo);
fprintf(1,'Done.\n');
adc.sys_cs = sys_cs;
grep -nH -R --include="*.m" dsexample
Contents.m:56:% dsexample1 - Discrete-time lowpass/bandpass/quadrature modulator.
Contents.m:57:% dsexample2 - Continuous-time lowpass modulator.
dsexample1(dsm, LiveDemo); d center frequency larger Hinfation Script
fprintf(1,'Done.\n');c = c;formed.s of finite op-amp gain and capacitorased;;n for the input.
adc.sys_cs = sys_cs;snr;seed with CT simulations tora states used in the d-t model_amp); Response');
What's going on ?
[Edit for clarification]: Why is there no file name, no line number on the 3rd line result ? Why results on the 4th and 5th line do not even contain dsexample ?
NB: using ack 3.40 and grep 2.16
I do not deserve any credits for this answer - It is all about line endings.
I have known for years about Windows line endings (CR-LF) and Linux line endings (LF only), but I had never heard of Legacy MAC line endings (CR only)... The latter really upsets ack, grep, and I'm sure lots of other tools.
dos2unix and unix2dos have no effect on files with Legacy MAC format - But after using this nifty little endline tool, I could eventually bring some consistency to the source files:
endlines : 129 files converted from :
- 23 Legacy Mac (CR)
- 105 Unix (LF)
- 1 Windows (CR-LF)
Now, ack and grep are much happier.
Let's see what files contain dsexample, grep -l doesn't print the contents, just file names:
$ grep -l dsexample *
Contents.m
demoLPandBP.m
dsexample1.m
dsexample2.m
Ok, then, file shows that they have CR line terminators. (It would say "CRLF line terminators" for Windows files.)
$ file Contents.m demoLPandBP.m dsexample*
Contents.m: ASCII text
demoLPandBP.m: ASCII text, with CR line terminators
dsexample1.m: ASCII text, with CR line terminators
dsexample2.m: ASCII text, with CR line terminators
Unlike what I commented about before, Contents.m is fine. Let's look at another one, how it prints:
$ grep dsexample demoLPandBP.m
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
The output from grep is actually the whole file, since grep doesn't consider the plain CR as breaking a line -- the whole file is just one line. If we change CRs to LFs, we see it better, or can just count the lines:
$ grep dsexample demoLPandBP.m | tr '\r' '\n' | wc -l
51
These are the longest lines there, in order:
%% 5th-order lowpass with optimized zeros and larger Hinf
dsm.f0 = 1/6; % Normalized center frequency
dsexample1(dsm, LiveDemo);
With a CR in the end of each, the cursor moves back to the start of the line, partially overwriting the previous output, so you get:
dsexample1(dsm, LiveDemo); d center frequency larger Hinf
(There's a space after the semicolon on that line, so the e gets overwritten too. I checked.)
Someone said dos2unix can't deal with that, and well, they're not DOS or Windows files anyway so why should it. You could do something like this, though, in Bash:
for f in *.m; do
if [[ $(file "$f") = *"ASCII text, with CR line terminators" ]]; then
tr '\r' '\n' < "$f" > tmptmptmp &&
mv tmptmptmp "$f"
fi
done
I think it was just the .m files that had the issue, hence the *.m in the loop. There was at least one PDF file there, and we don't want to break that. Though with the check on file there, it should be safe even if you just run the loop on *.
It looks like both ack and grep are getting confused by the line endings in the files. Run file *.m on your files. You'll see that some files have proper linefeeds, and some have CR line terminators.
If you clean up your line endings, things should be OK.

Creating a file by merging two files

I would like to merge two files and create a new file using Linux command.
I have the two files named as a1b.txt and a1c.txt
Content of a1b.txt
Hi,Hi,Hi
How,are,you
Content of a1c.txt
Hadoop|are|world
Data|Big|God
And I need a new file called merged.txt with the below content(expected output)
Hi,Hi,Hi
How,are,you
Hadoop|are|world
Data|Big|God
To achieve that in terminal I am running the below command,but it gives me output like below
Hi,Hi,Hi
How,are,youHadoop|are|world
Data|Big|God
cat /home/cloudera/inputfiles/a1* > merged.txt
Could somebody help on getting the expected ouput
Probably your files do not have newline characters. Here is how to put the newline character to them.
$ sed -i -e '$a\' /home/cloudera/inputfiles/a1*
$ cat /home/cloudera/inputfiles/a1* > merged.txt
If you are allowed to be destructive (not have to keep the original two files unmodified) then:
robert#debian:/tmp$ cat fileB.txt >> fileA.txt
robert#debian:/tmp$ cat fileA.txt
this is file A
This is file B.

text file contains lines of bizarre characters - want to fix

I'm an inexperienced programmer grappling with a new problem in a large text file which contains data I am trying to process. Here's a screen capture of what I'm looking at (using 'less' - I am on a linux server):
https://drive.google.com/file/d/0B4VAqfRxlxGpaW53THBNeGh5N2c/view?usp=sharing
Bioinformaticians will recognize this file as a "fastq" file containing DNA sequence data. The top half of the screenshot contains data in its expected format (which I admit contains some "bizarre" characters, but that is not the issue). However, the bottom half (with many characters shaded in white) is completely messed up. If I were to scroll down the file, it eventually returns to normal text after about 500 lines. I want to fix it because it is breaking downstream operations I am trying to perform (which complain about precisely this position in the file).
Is there a way to grep for and remove the shaded lines? Or can I fix this problem by somehow changing the encoding on the offending lines?
Thanks
If you are lucky, you can use
strings file > file2
Oh well, try it another way.
Determine the linelength of the correct lines (I think the first two lines are different).
head -1 file | wc -c
head -2 file | tail -1 | wc -c
Hmm, wc also counts the line-ending, substract 1 from both lengths.
Than try to read the file 1 line a time. Use a case-statement so you do not have to write a lot of else-if constructions for comparing the length to the expected length. In the code I will accept the lengths 20, 100 and 330
Redirect everything to another file outside the loop (inside will overwrite each line).
cat file | while read -r line; do
case ${#line} in
20|100|330) echo $line ;;
esac
done > file2
A total different approach would be filtering the wrong lines with sed, awk or grep but that would require knowledge what characters you will and won't accept.
Yes, when you are a lucky (wo-)man, all ugly lines will have a character in common like '<' or maybe an '#'. In that case you can use egrep:
egrep -v "<|#" file > file2
BASED ON INSPECTION OF THE SNAP
sed -r 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
to make the actual changes in the file and make a backup file with a .bak extension do
sed -r -i.bak 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file

What does 1c1 in diff tool mean?

I ran diff with two files and got the following output:
1c1
< dbacaad
---
> dbacaad
What does this mean? My two files seem to be exactly the same.
Thank you very much!
To answer the question you raised in the title: 1c1 indicates that line 1 in the
first file was c hanged somehow to produce line 1 in the second file.
In practical terms: They probably differ in whitespace (perhaps trailing spaces, or Unix versus Windows line endings?).
Try diff -w file1 file2, which will ignore whitespace. Or cmp file1 file2, which
will tell you how many bytes into the file the first difference occurs.

Diff-ing files with Linux command

What Linux command allow me to check if all the lines in file A exist in file B? (it's almost like a diff, but not quite). Also file A has uniq lines, as is the case with file B as well.
The comm command compares two sorted files, line by line, and is part of GNU coreutils.
Are you looking for a better diff tool?
https://stackoverflow.com/questions/12625/best-diff-tool
So, what if A has
a
a
b
and b has
a
b
What would you want the output to be(yes or no)?
Use diff command.
Here is a useful vide with complete usage of diff command under 3 min
Click Here
if cat A A B | sort | uniq -c | egrep -e '^[[:space:]]*2[[:space:]]' > /dev/null; then
echo "A has lines that are not in B."
fi
If you do not redirect the output, you will get a list of all the lines that are in A that are not in B (except each line will have a 2 in front if it). This relies on the lines in A being unique, and the lines in B being unique.
If they aren't, and you don't care about counting duplicates, it's relatively simple to transform each file into a list of unique lines using sort and uniq.

Resources