Is there a popular Linux/Unix format for binary diffs? - linux

I'm going to be producing binary deltas of multi-gigabyte files.
Naively, I'm intending to use the following format:
struct chunk {
uint64_t offset;
uint64_t length;
uint8_t data[];
};
struct delta {
uint8_t file_a_checksum[32]; // These are calculated while the
uint8_t file_b_checksum[32]; // gzipped chunks are being written
uint8_t chunks_checksum[32]; // at the 96 octet offset.
uint8_t gzipped_chunks[];
};
I only need to apply these deltas to the original file_a that was used to generate a delta.
Is there anything I'm missing here?
Is there an existing binary delta format which has the features I'm looking for, yet isn't too much more complex?

For arbitrary binaries, of course it makes sense to use a general purpose tool:
xdelta
bspatch
rdiff-backup (rsync)
git diff
(Yes, git diff works on files that aren't under version control. git diff --binary --no-index dir1/file.bin dir2/file.bin )
I would usually recommend a generic tool before writing your own, even if there is a little overhead. While none of the tools in the above list produce binary diffs in a format quite as ubiquitous as the "unified diff" format, they are all "close to" standard tools.
There is one other fairly standardised format that might be relevant for you: the humble hexdump. The xxd tool dumps binaries into a fairly standard text format by default:
0000050: 2020 2020 5858 4428 3129 0a0a 0a0a 4e08 XXD(1)....N.
That is, offset followed by a series of byte values. The exact format is flexible and configurable with command-line switches.
However, xxd can also be used in reverse mode to write those bytes instead of dumping them.
So if you have a file called patch.hexdump:
00000aa: bbccdd
Then running xxd -r patch.hexdump my.binary will modify the file my.binary to modify three bytes at offset 0xaa.
Finally, I should also mention that dd can seek into a binary file and read/write a given number of bytes, so I guess you could use "shell script with dd commands" as your patch format.

Related

Create a file of given size in Linux and fill it with user data pattern

I need to generate a file of a certain size such as 4KB, 128KB etc .
What is a command to create a file of a certain size on Linux? I vaguely remember DD tool serves this purpose such as
dd if=/dev/zero of=upload_test bs=file_size count=1.
I need to fill the created file with user patterns
1.) All zeroes
2.) Incremental 1 byte Data Pattern 0x00 0x01 0x02...0xFF 0x00 0x01...0x0FF ..
3.) All 1's /any fixed value
Is there any command line/scripts that can serve the above two purposes i.e create a file of certain size and fill it with patterns.
perl is almost certainly the right tool for this. My perl is (sadly) a bit rusty, and there are certainly better ways to do this, but a simple approach (for your option 2) would be something like:
$ count=2048; perl -e 'print pack("c", $x++ % 256) foreach (0..'$count')'

How do I seek for holes and data in a sparse file in golang [duplicate]

I want to copy files from one place to another and the problem is I deal with a lot of sparse files.
Is there any (easy) way of copying sparse files without becoming huge at the destination?
My basic code:
out, err := os.Create(bricks[0] + "/" + fileName)
in, err := os.Open(event.Name)
io.Copy(out, in)
Some background theory
Note that io.Copy() pipes raw bytes – which is sort of understandable once you consider that it pipes data from an io.Reader to an io.Writer which provide Read([]byte) and Write([]byte), correspondingly.
As such, io.Copy() is able to deal with absolutely any source providing
bytes and absolutely any sink consuming them.
On the other hand, the location of the holes in a file is a "side-channel" information which "classic" syscalls such as read(2) hide from their users.
io.Copy() is not able to convey such side-channel information in any way.
IOW, initially, file sparseness was an idea to just have efficient storage of the data behind the user's back.
So, no, there's no way io.Copy() could deal with sparse files in itself.
What to do about it
You'd need to go one level deeper and implement all this using the syscall package and some manual tinkering.
To work with holes, you should use the SEEK_HOLE and SEEK_DATA special values for the lseek(2) syscall which are, while formally non-standard, are supported by all major platforms.
Unfortunately, the support for those "whence" positions is not present
neither in the stock syscall package (as of Go 1.8.1)
nor in the golang.org/x/sys tree.
But fear not, there are two easy steps:
First, the stock syscall.Seek() is actually mapped to lseek(2)
on the relevant platforms.
Next, you'd need to figure out the correct values for SEEK_HOLE and
SEEK_DATA for the platforms you need to support.
Note that they are free to be different between different platforms!
Say, on my Linux system I can do simple
$ grep -E 'SEEK_(HOLE|DATA)' </usr/include/unistd.h
# define SEEK_DATA 3 /* Seek to next data. */
# define SEEK_HOLE 4 /* Seek to next hole. */
…to figure out the values for these symbols.
Now, say, you create a Linux-specific file in your package
containing something like
// +build linux
const (
SEEK_DATA = 3
SEEK_HOLE = 4
)
and then use these values with the syscall.Seek().
The file descriptor to pass to syscall.Seek() and friends
can be obtained from an opened file using the Fd() method
of os.File values.
The pattern to use when reading is to detect regions containing data, and read the data from them – see this for one example.
Note that this deals with reading sparse files; but if you'd want to actually transfer them as sparse – that is, with keeping this property of them, – the situation is more complicated: it appears to be even less portable, so some research and experimentation is due.
On Linux, it appears you could try to use fallocate(2) with
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE to try to punch a hole at the
end of the file you're writing to; if that legitimately fails
(with syscall.EOPNOTSUPP), you just shovel as many zeroed blocks to the destination file as covered by the hole you're reading – in the hope
the OS will do the right thing and will convert them to a hole by itself.
Note that some filesystems do not support holes at all – as a concept.
One example is the filesystems in the FAT family.
What I'm leading you to is that inability of creating a sparse file might
actually be a property of the target filesystem in your case.
You might find Go issue #13548 "archive/tar: add support for writing tar containing sparse files" to be of interest.
One more note: you might also consider checking whether the destination directory to copy a source file resides in the same filesystem as the source file, and if this holds true, use the syscall.Rename() (on POSIX systems)
or os.Rename() to just move the file across different directories w/o
actually copying its data.
You don't need to resort to syscalls.
package main
import "os"
func main() {
f, _ := os.Create("/tmp/sparse.dat")
f.Write([]byte("start"))
f.Seek(1024*1024*10, 0)
f.Write([]byte("end"))
}
Then you'll see:
$ ls -l /tmp/sparse.dat
-rw-rw-r-- 1 soren soren 10485763 Jun 25 14:29 /tmp/sparse.dat
$ du /tmp/sparse.dat
8 /tmp/sparse.dat
It's true you can't use io.Copy as is. Instead you need to implement an alternative to io.Copy which reads a chunk from the src, checks if it's all '\0'. If it is, just dst.Seek(len(chunk), os.SEEK_CUR) to skip past that part in dst. That particular implementation is left as an exercise to the reader :)

looking for fast way to edit a large file in linux

I have a large file, several gig of binary data, with an ASCII header at the top. I need to make a few small changes to the ASCII header. sed does the job, but it takes a fair bit of time since the file is so large. vi/vim is slow too. Is there any linux utility that can just go into the file, make the change at the top, and then get out quickly?
An example might be a header that looks like:
Code Rev: 3.5
Platform: platform1
Run Date: 12/13/16
Data source: whatever
Restart: False
followed by a large amount of binary data ....
and then I might need to, for example, edit an error in "Data source".
Provided that you know that your header is less than X bytes, you can use dd.
(!) But it only works if both strings have the same length (!)
Lets say, that the header is less that 4096 bytes
dd if=/path/to/file bs=4096 count=1 | sed 's/XXX/YYY/' | dd of=/path/to/file conv=notrunc
You can also do it programmatically, using languages like C,Python,PHP,JAVA etc. The idea is to open the file, read the header, fix it, and write it back.

(open + write) vs. (fopen + fwrite) to kernel /proc/

I have a very strange bug. If I do:
int fd = open("/proc/...", O_WRONLY);
write(fd, argv[1], strlen(argv[1]));
close(fd);
everything is working including for a very long string which length > 1024.
If I do:
FILE *fd = fopen("/proc/...", "wb");
fwrite(argv[1], 1, strlen(argv[1]), fd);
fclose(fd);
the string is cut around 1024 characters.
I'm running an ARM embedded device with a 3.4 kernel. I have debugged in the kernel and I see that the string is already cut when I reach the very early function vfs_write (I spotted this function with a WARN_ON instruction to get the stack).
The problem is the same with fputs vs. puts.
I can use fwrite for a very long string (>1024) if I write to a standard rootfs file. So the problem is really linked how the kernel handles /proc.
Any idea what's going on?
Probably the problem is with buffers.
The issue is that special files, such as those at /proc are, well..., special, they are not always simple stream of bytes, and have to be written to (or read from) with specific sizes and or offsets. You do not say what file you are writing to, so it is impossible to be sure.
Then, the call to fwrite() assumes that the output fd is a simple stream of bytes, so it does smart fancy things, such as buffering and splicing and copying the given data. In a regular file it will just work, but in a special file, funny things may happen.
Just to be sure, try to run strace with both versions of your program and compare the outputs. If you wish, post them for additional comments.

Writing number with a lot of digits into file using MPIR

I'm using MPIR/Ubuntu 14.04.
I have big integer that have a lot of digits, like 2^1920, and don't know how to write it into file *.txt
FILE *result;
result=fopen("Number.txt","a+");
gmp_fprintf(result,"%d",xyz);
fclose(result);
didn't work.
Are there some other options I can use?
The gmp_printf() function (thus subsequently gmp_fprintf() as well) requires special format specifier for mpz_t object (which I guess xyz is). You should use %Zd instead of plain %d, that does not work. To be pedantic it's undefined behavior to use inadequate f.s. in C.
If you don't need "full-featured" formatted output, then you might also take a look at mpz_out_str(), that allows to specify base (like 2 or 10):
size_t mpz_out_str (FILE *stream, int base, const mpz_t op)
Alternatively you might use mpz_out_raw() function that just "dumps" the whole number as it is stored in binary format:
size_t mpz_out_raw (FILE *stream, const mpz_t op)
Output op on stdio stream stream, in raw binary format. The integer is
written in a portable format, with 4 bytes of size information, and
that many bytes of limbs. Both the size and the limbs are written in
decreasing significance order (i.e., in big-endian).

Resources