potential dangers of merging two files of unknown size?

potential dangers of merging two files of unknown size? - visual-c++

I have a binary file that I need to insert a header at the beginning of. I was thinking of opening a new file, writing the header data, and then copying the data from the binary file to this new file. Since the binary file is about 1 megabyte, are there any dangers to making this file using fwrite? One specific concern would be something like unintentionally overwriting data, similar to what happens if using gets and the input is longer than the buffer.

There's no risk. Allocate a buffer of a given size, read that many bytes into it from the source file, write the buffer back out to the destination file. The operations (file read / file write) all take a maximum number of bytes so as long as your buffer is the size you claim it is, it won't be overrun.
Also, the approach you describe is pretty much the only way to do it. I've never heard of a filesystem that has an "insert these bytes at the beginning of this file" operation.

Related

Read/write a big file randomly - mmap on every read/write?

Let's say I have a big file , 1Go. I want to READ 10ko at offset 10, then WRITE 645ko at offset 235689, then READ 150Mo at offset 648975, and so on...
What is the best approach between these two:
Opening the file and mmap-ing it (which size?). Then do the reads/writes. At the end unmap and close it.
Or opening the file. On reads/writes, mmap-ing the file (which size?) and then unmamap-ing them. At the end close the file.

Doing mmap(1) on every I/O doesn't sound like the right thing - It would confuse the code reader and possibly the kernel's optimizations, and has no benefit.
You can use pread(1)/pwrite(1) or preadv(1)/pwritev(1) if you want to be explicit about your reads and writes.
If not, you can mmap(1) the entire file (but be sure to use the right flags, probably MAP_SHARED) - Linux won't try to load the entire file to memory anyway.

What is a quick way to check if file contents are null?

I have a rather large file (32 GB) which is an image of an SD card, created using dd.
I suspected that the file is empty (i.e. filled with the null byte \x00) starting from a certain point.
I checked this using python in the following way (where f is an open file handle with the cursor at the last position I could find data at):
for i in xrange(512):
if set(f.read(64*1048576))!=set(['\x00']):
print i
break
This worked well (in fact it revealed some data at the very end of the image), but took >9 minutes.
Has anyone got a better way to do this? There must be a much faster way, I'm sure, but cannot think of one.

Looking at a guide about memory buffers in python here I suspected that the comparator itself was the issue. In most non-typed languages memory copies are not very obvious despite being a killer for performance.
In this case, as Oded R. established, creating a buffer from read and comparing the result with a previously prepared nul filled one is much more efficient.
size = 512
data = bytearray(size)
cmp = bytearray(size)
And when reading:
f = open(FILENAME, 'rb')
f.readinto(data)
Two things that need to be taken into account is:
The size of the compared buffers should be equal, but comparing bigger buffers should be faster until some point (I would expect memory fragmentation to be the main limit)
The last buffer may not be the same size, reading the file into the prepared buffer will keep the tailing zeroes where we want them.
Here the comparison of the two buffers will be quick and there will be no attempts of casting the bytes to string (which we don't need) and since we reuse the same memory all the time, the garbage collector won't have much work either... :)

Squashfs check compressed file size

is there any way to check the final size of a specific file after compression in a squashfs filesystem?
I'm looking through mksquashfs/unsquashfs command line options but I can't find anything.
Using the -info option in mksquashfs only prints the size before the compression.
Thanks

This isn't feasible to do with much granularity, because compression is done at block level, not file level.
A file may be marked at starting 50kb into the size of the buffer created by decompressing block 50, and continuing to end 50 bytes into the decompressed block 52 (ignoring fragments here, which are a separate concern) -- but that doesn't let you map back to the position inside the compressed copy of block-50 where that file starts. (You can easily determine the compression ratio for block 51, but you can't easily figure out the ratios for the parts of the file contained in 50 and 52 in our example, because they're shared with other contents).
So the information isn't exposed because it isn't easily available. This actually makes storage of numerous (similar) small files significantly more efficient, because a single compression context is used for all of them (and decompressing a block to retrieve one file may mean that you've got files next to it already decompressed in memory)... but without potentially-unfounded assumptions (such as assuming that all contents within a block share that block's average ratio) it doesn't help with trying to backtrace how well each individual item compressed, because the items aren't compressed individually in the first place.

How does the bittorrent assemble the missing pieces?

I use BitTorrent and sometimes encounter files that do not have seed(missing pieces).
At that time, we sometimes force the file transfer to end and try to open the incompleted files (for example, an image file).
If we are lucky, may be able to see the downloaded image even if some parts are lost.
I would like to artificially reproduce this situation, and here's how I tried:
1) spliting a bmp image file of about 1 megabyte into 16 kilobytes by the Linux split command,
2) and then make just one of the divided files 0 kilobytes.
3) after that, rejoin all the files with the cat command.
However, in this case, unlike the torrent's "lost pieces" situation, the file becomes completely corrupt and can not be read.
Theoretically it does not seem like anything special, but what's wrong? And how can I achieve what I want?
I would appreciate your help.

Use dd:
dd if=/dev/zero of=image.jpg bs=1 conv=notrunc seek=X count=Y
being X the offset in the file you want to erase and Y the number of bytes.
About the corruption, it depends on the type of file, the piece you are losing and the program you are using to read it.
For instance, JPG files use a variable bit-length encoding, meaning that just losing one bit may corrupt all the file from that point on. But just for that, there can be resyncronization points where the bitstream is reset, so from that point on, the file will look ok. But those resync points are optional when writing the file, and not every reader honor them in case of corruption...
And anyway, losing part of the headers will make the file totally unreadable.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)

Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

potential dangers of merging two files of unknown size? - visual-c++

Related

Read/write a big file randomly - mmap on every read/write?

What is a quick way to check if file contents are null?

Squashfs check compressed file size

How does the bittorrent assemble the missing pieces?

Changing the head of a large Fortran binary file without dealing with the whole body

Categories

Resources