How does the bittorrent assemble the missing pieces? - linux

I use BitTorrent and sometimes encounter files that do not have seed(missing pieces).
At that time, we sometimes force the file transfer to end and try to open the incompleted files (for example, an image file).
If we are lucky, may be able to see the downloaded image even if some parts are lost.
I would like to artificially reproduce this situation, and here's how I tried:
1) spliting a bmp image file of about 1 megabyte into 16 kilobytes by the Linux split command,
2) and then make just one of the divided files 0 kilobytes.
3) after that, rejoin all the files with the cat command.
However, in this case, unlike the torrent's "lost pieces" situation, the file becomes completely corrupt and can not be read.
Theoretically it does not seem like anything special, but what's wrong? And how can I achieve what I want?
I would appreciate your help.

Use dd:
dd if=/dev/zero of=image.jpg bs=1 conv=notrunc seek=X count=Y
being X the offset in the file you want to erase and Y the number of bytes.
About the corruption, it depends on the type of file, the piece you are losing and the program you are using to read it.
For instance, JPG files use a variable bit-length encoding, meaning that just losing one bit may corrupt all the file from that point on. But just for that, there can be resyncronization points where the bitstream is reset, so from that point on, the file will look ok. But those resync points are optional when writing the file, and not every reader honor them in case of corruption...
And anyway, losing part of the headers will make the file totally unreadable.

Related

looking for fast way to edit a large file in linux

I have a large file, several gig of binary data, with an ASCII header at the top. I need to make a few small changes to the ASCII header. sed does the job, but it takes a fair bit of time since the file is so large. vi/vim is slow too. Is there any linux utility that can just go into the file, make the change at the top, and then get out quickly?
An example might be a header that looks like:
Code Rev: 3.5
Platform: platform1
Run Date: 12/13/16
Data source: whatever
Restart: False
followed by a large amount of binary data ....
and then I might need to, for example, edit an error in "Data source".
Provided that you know that your header is less than X bytes, you can use dd.
(!) But it only works if both strings have the same length (!)
Lets say, that the header is less that 4096 bytes
dd if=/path/to/file bs=4096 count=1 | sed 's/XXX/YYY/' | dd of=/path/to/file conv=notrunc
You can also do it programmatically, using languages like C,Python,PHP,JAVA etc. The idea is to open the file, read the header, fix it, and write it back.

Changing the head of a large Fortran binary file without dealing with the whole body

I have a large binary file (~ GB size) generated from a Fortran 90 program. I want to modify something in the head part of the file. The structure of the file is very complicated and contains many different variables, which I want to avoid going into. After reading and re-writing the head, is it possible to "copy and paste" the reminder of the file without knowing its detailed structure? Or even better, can I avoid re-writing the whole file altogether and just make changes on the original file? (Not sure if it matters, but the length of the header will be changed.)
Since you are changing the length of the header, I think that you have to write a new, revised file. You could avoid having to "understand" the records after the header by opening the file with stream access and just reading bytes (or perhaps four byte words if the file is a multiple of four bytes) until you reach EOF and copying them to the new file. But if the file was originally created as sequential access and you want to access it that way in the future, you will have to handle the record length information for the header record(s), including altering the value(s) to be consistent with the changed the length of the record(s). This record length information is typically a four-byte integer at beginning and end of each record, but it depends on the compiler.

CIFS/SMB Write Optimization

I am looking at making a write optimization for CIFS/SMB such that the writing of duplicate blocks are suppressed. For example, I read a file from the remote share and modify a portion near the end of the file. When I save the file, I only want to send write requests back to the remote side for the portions of the file that have actually changed. So basically, suppress all writes up until the point at which a non duplicate write is encountered. At that point the suppression will be disabled and the writes will be allowed as usual. The problem is I can't find any documentation MS-SMB/MS-SMB2/MS-CIFS or otherwise that indicates whether or not this is a valid thing to do. Does anyone know if this would be valid?
Dig deep into the sources of the Linux kernel, there is documentation on CIFS - both in source and text. E.g. http://www.mjmwired.net/kernel/Documentation/filesystems/cifs.txt
If you want to study the behaviour of e.g. the CIFS protocol, you may be able to test it with the unix command "dd". Mount any remote file-system via CIFS, e.g. into /media/remote. Change into this folder cd /media/remote Now create a file with some random stuff (e.g. from the kernel's random pool): dd if=/dev/urandom of=test.bin bs=4M count=5 In this example, you should see some 20MB of traffic. Then create another smaller file, somewhere on your machine, say, your home-folder: dd if=/dev/urandom of=~/test_chunk.bin bs=4M count=1 The interesting thing is what happens, if you attempt to write the chunk into a specific position of the remote test file: dd if=~/test_chunk.bin of=test.bin bs=4M count=1 seek=3 conv=notrunc Actually, this should only change block #4 out of 5 in the target file.
I guess you can adjust the block size ... I did this with 4 MB blocks. But it should help to understand what happens on the network.
The CIFS protocol does allow applications to write back specific portions of the file. This is controlled by the parameters DataOffset and DataLength in the SMB WriteAndX packet.
Documentation for the same can be found here:
http://msdn.microsoft.com/en-us/library/ee441954.aspx
The client can use these fields to write a specific length of data to specific offsets within the file.
Similar support exists in more recent versions of the protocol as well ...
SMB protocol have such write optimization. It works with append cifs operation. Where protocol read EOF for file and start writing new data with offset set to EOF value and length as append data bytes.

potential dangers of merging two files of unknown size?

I have a binary file that I need to insert a header at the beginning of. I was thinking of opening a new file, writing the header data, and then copying the data from the binary file to this new file. Since the binary file is about 1 megabyte, are there any dangers to making this file using fwrite? One specific concern would be something like unintentionally overwriting data, similar to what happens if using gets and the input is longer than the buffer.
There's no risk. Allocate a buffer of a given size, read that many bytes into it from the source file, write the buffer back out to the destination file. The operations (file read / file write) all take a maximum number of bytes so as long as your buffer is the size you claim it is, it won't be overrun.
Also, the approach you describe is pretty much the only way to do it. I've never heard of a filesystem that has an "insert these bytes at the beginning of this file" operation.

Doing file operations with 64-bit addresses in C + MinGW32

I'm trying to read in a 24 GB XML file in C, but it won't work. I'm printing out the current position using ftell() as I read it in, but once it gets to a big enough number, it goes back to a small number and starts over, never even getting 20% through the file. I assume this is a problem with the range of the variable that's used to store the position (long), which can go up to about 4,000,000,000 according to http://msdn.microsoft.com/en-us/library/s3f49ktz(VS.80).aspx, while my file is 25,000,000,000 bytes in size. A long long should work, but how would I change what my compiler(Cygwin/mingw32) uses or get it to have fopen64?
The ftell() function typically returns an unsigned long, which only goes up to 232 bytes (4 GB) on 32-bit systems. So you can't get the file offset for a 24 GB file to fit into a 32-bit long.
You may have the ftell64() function available, or the standard fgetpos() function may return a larger offset to you.
You might try using the OS provided file functions CreateFile and ReadFile. According to the File Pointers topic, the position is stored as a 64bit value.
Unless you can use a 64-bit method as suggested by Loadmaster, I think you will have to break the file up.
This resource seems to suggest it is possible using _telli64(). I can't test this though, as I don't use mingw.
I don't know of any way to do this in one file, a bit of a hack but if splitting the file up properly isn't a real option, you could write a few functions that temp split the file, one that uses ftell() to move through the file and swaps ftell() to a new file when its reaching the split point, then another that stitches the files back together before exiting. An absolutely botched up approach, but if no better solution comes to light it could be a way to get the job done.
I found the answer. Instead of using fopen, fseek, fread, fwrite... I'm using _open, lseeki64, read, write. And I am able to write and seek in > 4GB files.
Edit: It seems the latter functions are about 6x slower than the former ones. I'll give the bounty anyone who can explain that.
Edit: Oh, I learned here that read() and friends are unbuffered. What is the difference between read() and fread()?
Even if the ftell() in the Microsoft C library returns a 32-bit value and thus obviously will return bogus values once you reach 2 GB, just reading the file should still work fine. Or do you need to seek around in the file, too? For that you need _ftelli64() and _fseeki64().
Note that unlike some Unix systems, you don't need any special flag when opening the file to indicate that it is in some "64-bit mode". The underlying Win32 API handles large files just fine.

Resources