CIFS/SMB Write Optimization

CIFS/SMB Write Optimization - linux

I am looking at making a write optimization for CIFS/SMB such that the writing of duplicate blocks are suppressed. For example, I read a file from the remote share and modify a portion near the end of the file. When I save the file, I only want to send write requests back to the remote side for the portions of the file that have actually changed. So basically, suppress all writes up until the point at which a non duplicate write is encountered. At that point the suppression will be disabled and the writes will be allowed as usual. The problem is I can't find any documentation MS-SMB/MS-SMB2/MS-CIFS or otherwise that indicates whether or not this is a valid thing to do. Does anyone know if this would be valid?

Dig deep into the sources of the Linux kernel, there is documentation on CIFS - both in source and text. E.g. http://www.mjmwired.net/kernel/Documentation/filesystems/cifs.txt
If you want to study the behaviour of e.g. the CIFS protocol, you may be able to test it with the unix command "dd". Mount any remote file-system via CIFS, e.g. into /media/remote. Change into this folder cd /media/remote Now create a file with some random stuff (e.g. from the kernel's random pool): dd if=/dev/urandom of=test.bin bs=4M count=5 In this example, you should see some 20MB of traffic. Then create another smaller file, somewhere on your machine, say, your home-folder: dd if=/dev/urandom of=~/test_chunk.bin bs=4M count=1 The interesting thing is what happens, if you attempt to write the chunk into a specific position of the remote test file: dd if=~/test_chunk.bin of=test.bin bs=4M count=1 seek=3 conv=notrunc Actually, this should only change block #4 out of 5 in the target file.
I guess you can adjust the block size ... I did this with 4 MB blocks. But it should help to understand what happens on the network.

The CIFS protocol does allow applications to write back specific portions of the file. This is controlled by the parameters DataOffset and DataLength in the SMB WriteAndX packet.
Documentation for the same can be found here:
http://msdn.microsoft.com/en-us/library/ee441954.aspx
The client can use these fields to write a specific length of data to specific offsets within the file.
Similar support exists in more recent versions of the protocol as well ...

SMB protocol have such write optimization. It works with append cifs operation. Where protocol read EOF for file and start writing new data with offset set to EOF value and length as append data bytes.

Related

Is it possible to resize MTD partitions at runtime?

I have a very specific need:
to partially replace the content of a flash and to move MTD partition boundaries.
Current map is:
u-boot 0x000000 0x040000
u-boot-env 0x040000 0x010000
kernel 0x050000 0x230000
initrd 0x280000 0x170000
scripts 0x3f0000 0x010000
filesystem 0x400000 0xbf0000
firmware 0xff0000 0x010000
While desired output is:
u-boot 0x000000 0x040000
u-boot-env 0x040000 0x010000
kernel 0x050000 0x230000
filesystem 0x280000 0xd70000
firmware 0xff0000 0x010000
This means to collapse initrd, scripts and filesystem into a single area while leaving the others alone.
Problem is this should be achieved from the running system (booted with the "old" configuration") and I should rewrite kernel and "new" filesystem before rebooting.
The system is an embedded, so I have little space for maneuver (I have a SD card, though).
Of course the rewritten kernel will have "new" configuration written in its DTB.
Problem is transition.
Note: I have seen this Question, but it is very old and it has drawback to need kernel patches, which I would like to avoid.
NOTE2: this question has been flagged for deletion because "not about programming". I beg to disagree: I need to perform said operation on ~14k devices, most of them already sold to customers, so any workable solution should involve, at the very least, scripting.
NOTE3: if absolutely necessary I can even consider (small) kernel modifications (YES, I have means to update kernel remotely).

I will leave the Accepted answer as-is, but, for anyone who happens to come here to find a solution, I want to point out that:
Recent (<4 years old) mtd-utils, coupled with 4.0+ kernel support:
Definition of a "master" device (MTD device representing the full, unpartitioned Flash). This is a kernel option.
mtd-utils has a specific mtd-part utility that can add/delete MTD partitions dynamically. NOTE: this utility woks IF (and only if) the above is defined in Kernel.
With the above utility it's possible to build multiple, possibly overlapping partitions; use with care!

I have three ideas/suggestions:
Instead of moving the partitions, can you just split the "new" filesystem image into chunks and write them to the corresponding "old" MTD partitions? This way you don't really need to change MTD partition map. After booting into the new kernel, it will see the new contiguous root filesystem. For JFFS2 filesystem, it should be fairly straightforward to do using split or dd, flash_erase and nandwrite. Something like:
# WARNING: this script assumes that it runs from tmpfs and the old root filesystem is already unmounted.
# Also, it assumes that your shell has arithmetic evaluation, which handles hex (my busybox 1.29 ash does this).
# assuming newrootfs.img is the image of new rootfs
new_rootfs_img="newrootfs.img"
mtd_initrd="/dev/mtd3"
mtd_initrd_size=0x170000
mtd_scripts="/dev/mtd4"
mtd_scripts_size=0x010000
mtd_filesystem="/dev/mtd5"
mtd_filesystem_size=0xbf0000
# prepare chunks of new filesystem image
bs="0x1000"
# note: using arithmetic evaluation $(()) to convert from hex and do the math.
# dd doesn't handle hex numbers ("dd: invalid number '0x1000'") -- $(()) works this around
dd if="${new_rootfs_img}" of="rootfs_initrd" bs=$(( bs )) count=$(( mtd_initrd_size / bs ))
dd if="${new_rootfs_img}" of="rootfs_scripts" bs=$(( bs )) count=$(( mtd_scripts_size / bs )) skip=$(( mtd_initrd_size / bs ))
dd if="${new_rootfs_img}" of="rootfs_filesystem" bs=$(( bs )) count=$(( mtd_filesystem_size / bs )) skip=$(( ( mtd_initrd_size + mtd_scripts_size ) / bs ))
# there's no going back after this point
flash_eraseall -j "${mtd_initrd}"
flash_eraseall -j "${mtd_scripts}"
flash_eraseall -j "${mtd_filesystem}"
nandwrite -p "${mtd_initrd}" rootfs_initrd
nandwrite -p "${mtd_scripts}" rootfs_scripts
nandwrite -p "${mtd_filesystem}" rootfs_filesystem
# don't forget to update the kernel too
There is kernel support for concatenating MTD devices (which is exactly what you're trying to do). I don't see an easy way to use it, but you could create a kernel module, which concatenates the desired partitions for you into a contiguous MTD device.
In order to combine the 3 MTD partitions into one to write the new filesystem, you could create a dm-linear mapping over the 3 mtdblocks, and then turn it back into an MTD device using block2mtd. (i.e. mtdblock + device mapper linear + block2mtd) But it looks very awkward and I don't know if it'll work well (for say, OOB data).
EDIT1: added a comment explaining use of $(( bs )) -- to convert from hex as dd doesn't handle hex numbers directly (neither coreutils, nor busybox dd).

AFAIK, #andrey 's answer suggestion 1 is wrong.
an mtd partition is made of a sequence of blocks, any of which could be bad or go bad anytime. this is why the simple mtd char abstraction exists: an mtd char device (not the mtdblock one) is read sequentially and skips bad blocks. nandwrite also writes sequentially and skips bad blocks.
an mtd char device sort of acts like:
a single file into which you cannot random access, from which you can only read sequentially from the beginning to the end (or to where you get bored).
a single file into which you cannot random access, to which you can only write sequentially from the beginning (or from an erase block where you previously stopped reading) all the way to the end. (that is, you can truncate and append, but you cannot write mid-file.) to write you need to previously erase all erase blocks from where you start writing to the end of the partition.
this means that the partition size is the maximum theoretical capacity, but typically the capacity will be less due to bad blocks, and can be effectively reduced every time you rewrite the partition. you can never expect to write the full size of an mtd partition.
this is were #andrey 's suggestion 1 is wrong: it breaks up the file to be written into max-sized pieces before writing each piece. but you never know beforehand how much data will fit into an mtd partition without actually writing that data.
instead, you typically need to write some data, and you pray there will be enough good blocks to fit it. if at some point there are not, the write fails and the device reached end-of-life. needless to say, the larger the fraction of a partition you need, the higher the likelihood that the write will fail (and when that happens, it typically means that the device is toast).
to actually implement something akin to suggestion 1, you need to start writing into a partition (skipping bad blocks), and when you run out of erase blocks, you continue writing into the next partition, and so on. the point being: you cannot know where the data boundaries will lay until you actually write the data and fill each partition; there is no other way.

How does the bittorrent assemble the missing pieces?

I use BitTorrent and sometimes encounter files that do not have seed(missing pieces).
At that time, we sometimes force the file transfer to end and try to open the incompleted files (for example, an image file).
If we are lucky, may be able to see the downloaded image even if some parts are lost.
I would like to artificially reproduce this situation, and here's how I tried:
1) spliting a bmp image file of about 1 megabyte into 16 kilobytes by the Linux split command,
2) and then make just one of the divided files 0 kilobytes.
3) after that, rejoin all the files with the cat command.
However, in this case, unlike the torrent's "lost pieces" situation, the file becomes completely corrupt and can not be read.
Theoretically it does not seem like anything special, but what's wrong? And how can I achieve what I want?
I would appreciate your help.

Use dd:
dd if=/dev/zero of=image.jpg bs=1 conv=notrunc seek=X count=Y
being X the offset in the file you want to erase and Y the number of bytes.
About the corruption, it depends on the type of file, the piece you are losing and the program you are using to read it.
For instance, JPG files use a variable bit-length encoding, meaning that just losing one bit may corrupt all the file from that point on. But just for that, there can be resyncronization points where the bitstream is reset, so from that point on, the file will look ok. But those resync points are optional when writing the file, and not every reader honor them in case of corruption...
And anyway, losing part of the headers will make the file totally unreadable.

(Linux) Read mutliple files as a single one without having to copy the chunks to a new file first

(Linux) The problem at hand is the following:
Let's suppose we have foo_1 and foo_2 being in fact 2 chunks of the foo file, such as the command:
cat foo_1 foo_2 >foo
I would like a system to be able to consider {foo_1 + foo_2} as a single foo file without having to copy it first with the command above.
Depending on the command you use to read {foo_1 + foo_2], say you want a md5sum, you can just use named pipes, and it provides the feature.
You would do:
mkfifo my_named_pipe
cat foo_1 foo_2 >my_named_pipe &
md5sum my_named_pipe
That works!
But named pipes have a big limitation: all accesses must be sequential (no seek), since it is basically a pipe.
Hence this "named pipes" method is not a "generic read multiple files as a virtual single file".
Indeed that works in the example below for md5sum, because md5sum need only sequential reading of the file.
Now if that file was say a rar file or a video you would like to read with VLC, or an ISO you would like to mount and do random access on, that will fail since those softwares need not-sequential reads.
Question:
so, before calling the calvary, I mean writing myself a fuse filesystem that will do what I described above to save precious I/O and space, I would like to know if you have heard of a generic method to do so.
What I am thinking of is something looking like:
fuseVirtualFile mountpoint foo foo_1 foo_2
That would show the "virtual file" foo under mountpoint, so mountpoint/foo
This "virtual file" would be the read-only concatenation of foo_1 and foo_2, without having to actually do the write I/O which saves time, disk space, and wear on the SSD!

So, since it apparently didn't exist, I just create it!
Behold: mfs
This is a fuse filesystem that will do as asked by my question, which is "virtually merge several files into a single one".
Then it becomes possible to access (read-only) the merged file as if it had actually been merged into a single file by a cat command.
As already stated in the question, this is only useful if you need random read acces, since stream access can be done via named pipes.
Here it is: https://github.com/Bylon/mfs
Enjoy!

Comparing a big file on two servers

I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.

You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.

Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz

This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done

Upload output of a program directly to a remote file by ftp

I have some program that generates a lot of data, to be specific encrypting tarballs. I want to upload result on a remote ftp server.
Files are quite big (about 60GB), so I don't want to waste hdd space for tmp dir and time.
Is it possible? I checked ncftput util, but there is not option to read from a standard input.

curl can upload while reading from stdin:
-T, --upload-file
[...]
Use the file name "-" (a single dash) to use stdin instead of a given
file. Alternately, the file name "." (a single period) may be
specified instead of "-" to use stdin in non-blocking mode to allow
reading server output while stdin is being uploaded.
[...]

I guess you could do that with any upload program using named pipe, but I foresee problems if some part of the upload goes wrong and you have to restart your upload: the data is gone and you cannot start back your upload, even if you only lost 1 byte. This also applied to a read from stdin strategy.
My strategy would be the following:
Create a named pipe using mkfifo.
Start the encryption process writing to that named pipe in the background. Soon, the pipe buffer will be full and the encryption process will be blocked trying to write data to the pipe. It should unblock when we will read data from the pipe later.
Read a certain amount of data from the named pipe (let say 1 GB) and put this in a file. The utility dd could be used for that.
Upload that file though ftp doing it the standard way. You then can deal with retries and network errors. Once the upload is completed, delete the file.
Go back to step 3 until you get a EOF from the pipe. This will mean that the encryption process is done writing to the pipe.
On the server, append the files in order to an empty file, deleting the files one by one once it has been appended. Using touch next_file; for f in ordered_list_of_files; do cat $f >> next_file; rm $f; done or some variant should do it.
You can of course prepare the next file while you upload the previous file to use concurrency at its maximum. The bottleneck will be either your encryption algorithm (CPU), you network bandwidth, or your disk bandwidth.
This method will waste you 2 GB of disk space on the client side (or less or more depending the size of the files), and 1 GB of disk space on the server side. But you can be sure that you will not have to do it again if your upload hang near the end.
If you want to be double sure about the result of the transfer, you could compute hash of you files while writing them to disk on the client side, and only delete the client file once you have verify the hash on the server side. The hash can be computed on the client side at the same time you are writing the file to disk using dd ... | tee local_file | sha1sum. On the server side, you would have to compute the hash before doing the cat, and avoid doing the cat if the hash is not good, so I cannot see how to do it without reading the file twice (once for the hash, and once for the cat).

You can write to a remote file using ssh:
program | ssh -l userid host 'cd /some/remote/directory && cat - > filename'

This is a sample of uploading to ftp site by curl
wget -O- http://www.example.com/test.zip | curl -T - ftp://user:password#ftp.example.com:2021/upload/test.zip

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string