I ran diff with two files and got the following output:
1c1
< dbacaad
---
> dbacaad
What does this mean? My two files seem to be exactly the same.
Thank you very much!
To answer the question you raised in the title: 1c1 indicates that line 1 in the
first file was c hanged somehow to produce line 1 in the second file.
In practical terms: They probably differ in whitespace (perhaps trailing spaces, or Unix versus Windows line endings?).
Try diff -w file1 file2, which will ignore whitespace. Or cmp file1 file2, which
will tell you how many bytes into the file the first difference occurs.
Related
I'm an inexperienced programmer grappling with a new problem in a large text file which contains data I am trying to process. Here's a screen capture of what I'm looking at (using 'less' - I am on a linux server):
https://drive.google.com/file/d/0B4VAqfRxlxGpaW53THBNeGh5N2c/view?usp=sharing
Bioinformaticians will recognize this file as a "fastq" file containing DNA sequence data. The top half of the screenshot contains data in its expected format (which I admit contains some "bizarre" characters, but that is not the issue). However, the bottom half (with many characters shaded in white) is completely messed up. If I were to scroll down the file, it eventually returns to normal text after about 500 lines. I want to fix it because it is breaking downstream operations I am trying to perform (which complain about precisely this position in the file).
Is there a way to grep for and remove the shaded lines? Or can I fix this problem by somehow changing the encoding on the offending lines?
Thanks
If you are lucky, you can use
strings file > file2
Oh well, try it another way.
Determine the linelength of the correct lines (I think the first two lines are different).
head -1 file | wc -c
head -2 file | tail -1 | wc -c
Hmm, wc also counts the line-ending, substract 1 from both lengths.
Than try to read the file 1 line a time. Use a case-statement so you do not have to write a lot of else-if constructions for comparing the length to the expected length. In the code I will accept the lengths 20, 100 and 330
Redirect everything to another file outside the loop (inside will overwrite each line).
cat file | while read -r line; do
case ${#line} in
20|100|330) echo $line ;;
esac
done > file2
A total different approach would be filtering the wrong lines with sed, awk or grep but that would require knowledge what characters you will and won't accept.
Yes, when you are a lucky (wo-)man, all ugly lines will have a character in common like '<' or maybe an '#'. In that case you can use egrep:
egrep -v "<|#" file > file2
BASED ON INSPECTION OF THE SNAP
sed -r 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
to make the actual changes in the file and make a backup file with a .bak extension do
sed -r -i.bak 's/<[[:alnum:]]{2}>//g;s/\^.//g;s/ESC\^*C*//g' file
Original file contains:
B
RBWBW
RWRWWRBWWWBRBWRWWBWWB
My file contains :
B
RBWBW
RWRWWRBWWWBRBWRWWBWWB
However when i use the command diff original myfile it shows following:
1,3c1,3
< B
< RBWBW
< RWRWWRBWWWBRBWRWWBWWB
---
> B
> RBWBW
> RWRWWRBWWWBRBWRWWBWWB
When i put -w tag (diff original myfile -w) it shows no differences... but I'm absolutely sure these two files do not have whitespace/endline differences. What's the problem?
These texts are equal.
Maybe you have extra white spaces.
try
diff -w -B file1.txt file2.txt
-w Ignore all white space.
-B Ignore changes whose lines are all blank.
As seen in the comments, you must have some different line endings, caused because of an original file coming from a DOS system. That's why using -w dropped the end of the line and files matched.
To repair the file, execute:
dos2unix file
Look at them in Hex format. This way you can really see if they are the same.
Firstly, which is the best and fastest unix command to get only the differences between two files ? I tried using diff to do it (below).
I tried the answer given by Neilvert Noval over here - Compare two files line by line and generate the difference in another file
code -
diff -a --suppress-common-lines -y file1.txt file2.txt >> file3.txt
But, I get a lot of spaces and a > symbol also before the different lines. How do I fix that ? I was thinking of removing trailing spaces and the first '>', but not sure if that is a neat fix.
My file1.txt has -
Hello World!
Its such a nice day!
#this is a newline and not a line of text#
My file1.txt has -
Hello World!
Its such a nice day!
Glad to be here!
#this is a newline and not a line of text#
Output - " #Many spaces here# > Glad to be here:)"
Expected output - Glad to be here:)
Another way to get diff is by using awk:
awk 'FNR==NR{a[$0];next}!($0 in a)' file1 file2
Though I must admit that I haven't run any benchmarks and can't say which is the fastest solution.
The -y option to diff makes it produce a "side by side" diff, which is why you have the spaces. Try -u 0 for the unified format with zero lines of context. That should print:
+Glad to be here:)
The plus means the line was added, whereas a minus means it was removed.
diff -a --suppress-common-lines -y file1.txt file2.txt|tr 'a >' '' |awk '{print $1}' >>file3.txt
I am still new to Unix, however I am eager to learn it..
I have 2 files, some lines have some matching substrings, I would like to concatenate these lines into one lines, leaving other untouched. Here below is an example for that..
File 1 (fasta file):
>292183
AGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGACAGGCTTAACACATGCAAGTCGAGGGGCAGCGGGGAGGAAGCTTGCTTTCTCTGCCGGCGACCGG CGCACGGGTGAGT
>551166
GTCGAGCGGCGAACGGGTGAGTAACGCGTGGATTATCTGCCCCGAGGTGGGGGATAACCCGGGGAAACTCGGGCTAATACCGCATATGACCGTGAGGTCA AAGGGGGGTCGCA
File 2:
292183 k__Bacteria
551166 k__Bacteria; p__Acidobacteria
The desired output:
>292183 k__Bacteria
AGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGACAGGCTTAACACATGCAAGTCGAGGGGCAGCGGGGAGGAAGCTTGCTTTCTCTGCCGGCGACCGG CGCACGGGTGAGT
>551166 k__Bacteria; p__Acidobacteria
GTCGAGCGGCGAACGGGTGAGTAACGCGTGGATTATCTGCCCCGAGGTGGGGGATAACCCGGGGAAACTCGGGCTAATACCGCATATGACCGTGAGGTCA AAGGGGGGTCGCA
I tried to use awk and perl for that, but I never had them into one file..
I appreciate any help,
Best Regards,
M
sed 's/\([0-9]*\).*/s.\1.&./' File_2 | sed -f- File_1
diff has an option -I regexp, which ignores changes that just insert or delete lines that match the given regexp. I need an analogue of this for the case, when changes are between two lines (rather then insert or delete lines).
For instance, I want to ignore all differences like between "abXd" and "abYd", for given X and Y.
It seems diff has not such kind of ability. Is there any suitable alternative for diff?
You could filter the two files through sed to eliminate the lines you don't care about. The general pattern is /regex1/,/regex2/ d to delete anything between lines matching two regexes. For example:
diff <(sed '/abXd/,/abYd/d' file1) <(sed '/abXd/,/abYd/d' file2)
Improving upon the earlier solution by John Kugelman:
diff <(sed 's/ab[XY]d/abd/g' file1) <(sed 's/ab[XY]d/abd/g' file2)
is probably what you may be looking for! This version normalizes the specific change on each line without deleting the line itself. This allows diff to show any other differences that remain on the line.
Assuming X and Y are single characters, then -I 'ab[XY]d' works fine for me.
You could use sed to replace instances of the pattern with a standard string:
diff <(sed 's/ab[XY]d/ab__REPLACED__d/g' file1) <(sed 's/ab[XY]d/ab__REPLACED__d/g' file2)
My open-source Linux tool 'dif' compares files while ignoring various differences.
It has many options for doing search/replace, ignoring whitespace, comments, or timestamps, sorting the input files, ignoring certain lines, etc.
After preprocessing the input files, it runs the Linux tools meld, gvimdiff, tkdiff, diff, or kompare on these intermediate files.
Installation is not required, just download and run the 'dif' executable from https://github.com/koknat/dif
For your use case, try the search and replace option:
./dif file1 file2 -search 'ab[XY]d' -replace 'abd' -diff