Find difference line by line - linux

I have a program which stores some data in two files stored in separate folders. /Path_1/File A and /Path_2/File B.
Now I need to compare those two files line by line for any differences. If any difference noted, I need to capture that and stored in a separate file or print it on the screen.
I tried using comm,diff and join. But none of them worked so far. Appreciate any help.
Sample file looks like following.
124 days
3.10.0-327.13.1.el7.x86_64
/dev/mapper/vg_sda-lv_root ext4
devtmpfs devtmpfs
In other file number of days and kernel version can be differ. I only need to capture that while running a script.
I tried diff -y -W 120 Source/File Destination/File , comm File1 File2

You can try this:
diff --suppress-common-lines /path_1/file_a /path_2/file_b > output
Where --suppress-common-lines does:
"do not output common lines"

Related

Is it possible to display a file's contents and delete that file in the same command?

I'm trying to display the output of an AWS lambda that is being captured in a temporary text file, and I want to remove that file as I display its contents. Right now I'm doing:
... && cat output.json && rm output.json
Is there a clever way to combine those last two commands into one command? My goal is to make the full combined command string as short as possible.
For cases where
it is possible to control the name of the temporary text file.
If file is not used by other code
Possible to pass "/dev/stdout" as the.name of the output
Regarding portability: see stack exchange how portable ... /dev/stdout
POSIX 7 says they are extensions.
Base Definitions,
Section 2.1.1 Requirements:
The system may provide non-standard extensions. These are features not required by POSIX.1-2008 and may include, but are not limited to:
[...]
• Additional character special files with special properties (for example,  /dev/stdin, /dev/stdout,  and  /dev/stderr)
Using the mandatory supported /dev/tty will force output into “current” terminal, making it impossible to pipe the output of the whole command into different program (or log file), or to use the program when there is no connected terminals (cron job, or other automation tools)
No, you cannot easily remove the lines of a file while displaying them. It would be highly inefficient as it would require removing characters from the beginning of a file each time you read a line. Current filesystems are pretty good at truncating lines at the end of a file, but not at the beginning.
A simple but extremely slow method would look like this:
while [ -s output.json ]
do
head -1 output.json
sed -i 1d output.json
done
While this algorithm is plain and simple, you should know that each time you remove the first line with sed -i 1d it will copy the whole content of the file but the first line into a temporary file, resulting in approximately 0.5*n² lines written in total (where n is the number of lines in your file).
In theory you could avoid this by do something like that:
while [ -s output.json ]
do
line=$(head -1 output.json)
printf -- '%s\n' "$line"
fallocate -c -o 0 -l $((${#len}+1)) output.json
done
But this does not account for variable newline characters (namely DOS-formatted newlines) and fallocate does not always work on xfs, among other issues.
Since you are trying to consume a file alongside its creation without leaving a trace of its existence on disk, you are essentially asking for a pipe functionality. In my opinion you should look into how your output.json file is produced and hopefully you can pipe it to a script of your own.

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

How to use sed command to delete lines without backup file?

I have large file with size of 130GB.
# ls -lrth
-rw-------. 1 root root 129G Apr 20 04:25 syslog.log
So I need to reduce file size by deleting line which starts with "Nov 2" , So I have given the following command,
sed -i '/Nov 2/d' syslog.log
So I can't edit file using VIM editor also.
When I trigger SED command , its creating backup file also. But I don't have much space in root. Please try to give alternate solution to delete particular line from this file without increasing space in server.
It does not create a real backup file. sed is a stream editor. When applied to a file with option -i it will stream that file through the sed process, write the output to a new file (a temporary one), when everything is done, it will rename the new file to the original name.
(There are options to create backup files also, but you didn't give them, so I won't mention that further.)
In your case you have a very large file and don't want to create any copy, however temporary. For this you need to open the file for reading and writing at the same time, then your sed process can overwrite the original. After this, you will have to truncate the file at the end of the writing.
To demonstrate how this can be done, we first perform a test case.
Create a test file, containing lots of lines:
seq 0 999999 > x
Now, lets say we want to remove all lines containing the digit 4:
grep -v 4 1<>x <x
This will open the file for reading and writing as STDOUT (1), and for reading as STDIN. The grep command will read all lines and will output only the lines not containing a 4 (option -v).
This will effectively overwrite the beginning of the original file.
You will not know how long the output is, so after the output the original contents of the file will appear:
…
999991
999992
999993
999995
999996
999997
999998
999999
537824
537825
537826
537827
537828
537829
…
You can use the Unix tool truncate to shorten your file manually afterwards. In a real scenario you will have trouble finding the right spot for this, so it makes sense to count the number of bytes written (using wc):
(Don't forget to recreate the original x for this test.)
(grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c
This will preform the step above and additionally print out the number of bytes written to the terminal, in this example case the output will be 3653658. Now use truncate:
truncate -s 3653658 x
Now you have the result you want.
If you want to do this in a script, i. e. without interaction, you can use this:
length=$((grep -v 4 <x | tee /dev/stderr 1<>x) |& wc -c)
truncate -s "$length" x
I cannot guarantee that this will work for files >2GB or >4GB on your machine; depending on your operating system (32bit?) and the versions of the installed tools you might run into largefile issues. I'd perform tests with large files first (>4GB as this is typically a limit for many things) and then cross your fingers and give it a try :)
Some caveats you have to keep in mind:
Of course, nobody is supposed to append log entries to that log file while the procedure is running.
Also, any abort during the running of the process (power failure, signal caught, etc.) will leave the file in an undefined state. But re-running the command again after such a mishap will in most cases produce the correct output; some lines might be doubled, but not more than a single line should be corrupted then.
The output must be smaller than the input, of course, otherwise the writing will overtake the reading, corrupting the whole result so that lines which should be there will be missing (or truncated at the start).

Manually merge two files using diff

I'd like to merge two files by doing the following:
Output the diff of the two files into a temp file and
Manually select the lines I want to copy/save.
The problem here is that diff -u only gives me a file lines of context, while I want to output the entire file in a unified format.
Is there any way diff can do this?
One option that might fit the bill for you,
sdiff : side-by-side diff of files.
sdiff -o merged.file left.file right.file
Once there, it will prompt you with what lines you want to keep from which file. Hit ? and then enter for a little help. Also man sdiff with the detailed goods.
(In my distro, these come packaged in the "diffutils" package [fedora,centos])
If you need to automate the process, you might want to try the util merge, which will mark conflicts in the files. However, that might put you back at square one.
"I want to output the entire file in a unified format. Is there any way diff can do this?"
Yes.
diff -U 9999999 file1.txt file2.txt > diff.txt
This should work, provided your files are less than 10 million lines long.
You can merge/combine the two files with diff using --
diff --line-format %L file1 file2
The easy answer is to use the -D flag to merge the files and surround the differences with C style #ifdef statements.
From the documentation:
-D NAME --ifdef=NAME
Output merged file to show `#ifdef NAME' diffs.
You can use it as follows:
$ diff -D NEWSTUFF file1 file2 > merged_file
I usually then just open the merged file in an editor and resolve the merge conflicts by hand.
You also can use options to output an ed script, etc.
If you are an emacs user, you can do this directly in emacs using the "emerge" tool:
https://www.gnu.org/software/emacs/manual/html_node/emacs/Emerge.html
Issuing M-x emerge-files will open an interactive prompt with a view of files A, B, and the merged file to allow choosing text that differs between files A & B, inserting part of A into B, and more.

compare two files and save the difference in linux

I like to compare two text files and save the difference under linux.
I know there are tools like kdiff, diff vimdiff etc. but my expectation are as follows.
Output should be in a separate file
The difference should be quoted with colours, ex: delete line in red and added line in green something like that
It should ignore space differences
It should be an opensource tool
use tkdiff4 -w file-name1 file-name2
It fulfills all your requirements. Specific color might be an issue.
try colordiff and man diff for options for ignoring whitespace etc
Like,
#!/bin/bash
wdiff -w "\e[31m" -x "\e[0m" -y "\e[32m" -z "\e[0m" "$#";
replace \e by, well, the ASCII character with value 0x1A. Put the two commands into some file, and run it using redirection.
Save the changes to a file:
diff -Nur originalfile newfile > patchfile
Use the difference file to change the origin file:
patch originfile patchfile
I think this is the easiest way to save the changes and reload the changes.
By the way, you can use this command the create an update-package.

Resources