How to simulate an ftp transfer of a big file with curl?

How to simulate an ftp transfer of a big file with curl? - linux

I want to check the FTP connection with FTP server. To do that I want to make a transfer of big file (like 10 G) to the FTP server (upload) inorder to check my FTP connection.
But to do that I have to generate a big file of 10 G. and this is not good especially if my device does not have enough memory/disk. So I want to simulate a transfer of big file (of 10 G for example) without generating a file of 10G.
Is it possible to do it with curl command?

'truncate' (the command line tool) is your friend. It can create huge files that are "sparse", meaning that the long sequence of zeroes is just a hole that won't actually occupy space on the drive.
Therefore, you can create a 10G file (containing all zeroes) in a much smaller file system and hard drive like this (works on Linux):
$ truncate --size=10G bigfile

With FTP, assuming that the client and server have small RT delay, sending 10G will take the same amount of sending 100 files of 100M, over the same open connection. Hopefully your device can have that much space allocated.
The key is to avoid re-openning the connection. Do something like the following to keep the connection open.
X=ftp://server/path/to/100M
X100=$(for i in {1..100} ; do echo " " $X ; done )
curl $X100

Related

How to copy multiple files simultaneously using scp

I would like to copy multiple files simultaneously to speed up my process I currently used the follow
scp -r root#xxx.xxx.xx.xx:/var/www/example/example.example.com .
but it only copies one file at a time. I have a 100 Mbps fibre so I have the bandwidth available to really copy a lot at the same time, please help.

You can use background task with wait command.
Wait command ensures that all the background tasks are completed before processing next line. i.e echo will be executed after scp for all three nodes are completed.
#!/bin/bash
scp -i anuruddha.pem myfile1.tar centos#192.168.30.79:/tmp &
scp -i anuruddha.pem myfile2.tar centos#192.168.30.80:/tmp &
scp -i anuruddha.pem myfile.tar centos#192.168.30.81:/tmp &
wait
echo "SCP completed"

SSH is able to do so-called "multiplexing" - more connections over one (to one server). It can be one way to afford what you want. Look up keywords like "ControlMaster"
Second way is using more connections, then send every job at background:
for file in file1 file2 file3 ; do
scp $file server:/tmp/ &
done
But, this is answer to your question - "How to copy multiple files simultaneously". For speed up, you can use weaker encryption (rc4 etc) and also don't forget, that the bottleneck can be your hard drive - because SCP don't implicitly limit transfer speed.
Last thing is using rsync - in some cases, it can be lot faster than scp...

I am not sure if this helps you, but I generally archive (compression is not required. just archiving is sufficient) file at the source, download it, extract them. This will speed up the process significantly.
Before archiving it took > 8 hours to download 1GB
After archiving it took < 8 minutes to do the same

You can use parallel-scp (AKA pscp): http://manpages.ubuntu.com/manpages/natty/man1/parallel-scp.1.html
With this tool, you can copy a file (or files) to multiple hosts simultaneously.
Regards,

100mbit Ethernet is pretty slow, actually. You can expect 8 MiB/s in theory. In practice, you usually get between 4-6 MiB/s at best.
That said, you won't see a speed increase if you run multiple sessions in parallel. You can try it yourself, simply run two parallel SCP sessions copying two large files. My guess is that you won't see a noticeable speedup. The reasons for this are:
The slowest component on the network path between the two computers determines the max. speed.
Other people might be accessing example.com at the same time, reducing the bandwidth that it can give you
100mbit Ethernet requires pretty big gaps between two consecutive network packets. GBit Ethernet is much better in this regard.
Solutions:
Compress the data before sending it over the wire
Use a tool like rsync (which uses SSH under the hood) to copy on the files which have changed since the last time you ran the command.
Creating a lot of small files takes a lot of time. Try to create an archive of all the files on the remote side and send that as a single archive.
The last suggestion can be done like this:
ssh root#xxx "cd /var/www/example ; tar cf - example.example.com" > example.com.tar
or with compression:
ssh root#xxx "cd /var/www/example ; tar czf - example.example.com" > example.com.tar.gz
Note: bzip2 compresses better but slower. That's why I use gzip (z) for tasks like this.

If you specify multiple files scp will download them sequentially:
scp -r root#xxx.xxx.xx.xx:/var/www/example/file1 root#xxx.xxx.xx.xx:/var/www/example/file2 .
Alternatively, if you want the files to be downloaded in parallel, then use multiple invocations of scp, putting each in the background.
#! /usr/bin/env bash
scp root#xxx.xxx.xx.xx:/var/www/example/file1 . &
scp root#xxx.xxx.xx.xx:/var/www/example/file2 . &

Comparing a big file on two servers

I have two servers and I want to move a backup tar.bz file(50G) from one to other one.
I used AXEL to download file from source server. But now when I want to extract it, it gives me error unexpected EOF. The size of them are same and it seems like there is a problem in content.
I want to know if there is a program/app/script that can compare these two files and correct only damaged parts?! Or do I need to split it by hand and compare each part's hash code?
Problem is here that source server has limited bandwidth and low transfers speed so I cant transfer it again from zero.

You can use a checksum utility, such as md5 or sha, to see if the files are the same on either end. e.g.
$ md5 somefile
MD5 (somefile) = d41d8cd98f00b204e9800998ecf8427e
by running such a command on both ends and comparing the result, you can get some certainty as to if the files are the same.
As for only downloading the erroneous portion of a file, this would require checksums on both sides for "pieces" of the data, such as with the bittorrent protocol.

Ok, I found "rdiff" the best way to solve this problem. Just doing:
On Destination Server:
rdiff signature destFile.tar.bz destFile.sig
Then transferring destFile.sig to source server and execute rdiff there on Source Server again:
rdiff delta destFile.sig srcFile.tar.bz delta.rdiff
Then transferring delta.rdiff to destination server and execute rdiff once again on Destination Server:
rdiff patch destFile.tar.bz delta.rdiff fixedFile.tar.bz

This process really doesn't need a separate program. You can simply do it by using a couple of simple commands. If any of the md5sums don't add up, copy over the mismatched one(s) and concatenate them back together. To make comparing the md5sums easier, just run a diff between the output of the two files (or do an md5sum of the outputs to see if there is a difference at all without having to copy over the output).
split -b 1000000000 -d bigfile bigfile.
for i in bigfile.*
do
md5sum $i
done

CIFS/SMB Write Optimization

I am looking at making a write optimization for CIFS/SMB such that the writing of duplicate blocks are suppressed. For example, I read a file from the remote share and modify a portion near the end of the file. When I save the file, I only want to send write requests back to the remote side for the portions of the file that have actually changed. So basically, suppress all writes up until the point at which a non duplicate write is encountered. At that point the suppression will be disabled and the writes will be allowed as usual. The problem is I can't find any documentation MS-SMB/MS-SMB2/MS-CIFS or otherwise that indicates whether or not this is a valid thing to do. Does anyone know if this would be valid?

Dig deep into the sources of the Linux kernel, there is documentation on CIFS - both in source and text. E.g. http://www.mjmwired.net/kernel/Documentation/filesystems/cifs.txt
If you want to study the behaviour of e.g. the CIFS protocol, you may be able to test it with the unix command "dd". Mount any remote file-system via CIFS, e.g. into /media/remote. Change into this folder cd /media/remote Now create a file with some random stuff (e.g. from the kernel's random pool): dd if=/dev/urandom of=test.bin bs=4M count=5 In this example, you should see some 20MB of traffic. Then create another smaller file, somewhere on your machine, say, your home-folder: dd if=/dev/urandom of=~/test_chunk.bin bs=4M count=1 The interesting thing is what happens, if you attempt to write the chunk into a specific position of the remote test file: dd if=~/test_chunk.bin of=test.bin bs=4M count=1 seek=3 conv=notrunc Actually, this should only change block #4 out of 5 in the target file.
I guess you can adjust the block size ... I did this with 4 MB blocks. But it should help to understand what happens on the network.

The CIFS protocol does allow applications to write back specific portions of the file. This is controlled by the parameters DataOffset and DataLength in the SMB WriteAndX packet.
Documentation for the same can be found here:
http://msdn.microsoft.com/en-us/library/ee441954.aspx
The client can use these fields to write a specific length of data to specific offsets within the file.
Similar support exists in more recent versions of the protocol as well ...

SMB protocol have such write optimization. It works with append cifs operation. Where protocol read EOF for file and start writing new data with offset set to EOF value and length as append data bytes.

Upload output of a program directly to a remote file by ftp

I have some program that generates a lot of data, to be specific encrypting tarballs. I want to upload result on a remote ftp server.
Files are quite big (about 60GB), so I don't want to waste hdd space for tmp dir and time.
Is it possible? I checked ncftput util, but there is not option to read from a standard input.

curl can upload while reading from stdin:
-T, --upload-file
[...]
Use the file name "-" (a single dash) to use stdin instead of a given
file. Alternately, the file name "." (a single period) may be
specified instead of "-" to use stdin in non-blocking mode to allow
reading server output while stdin is being uploaded.
[...]

I guess you could do that with any upload program using named pipe, but I foresee problems if some part of the upload goes wrong and you have to restart your upload: the data is gone and you cannot start back your upload, even if you only lost 1 byte. This also applied to a read from stdin strategy.
My strategy would be the following:
Create a named pipe using mkfifo.
Start the encryption process writing to that named pipe in the background. Soon, the pipe buffer will be full and the encryption process will be blocked trying to write data to the pipe. It should unblock when we will read data from the pipe later.
Read a certain amount of data from the named pipe (let say 1 GB) and put this in a file. The utility dd could be used for that.
Upload that file though ftp doing it the standard way. You then can deal with retries and network errors. Once the upload is completed, delete the file.
Go back to step 3 until you get a EOF from the pipe. This will mean that the encryption process is done writing to the pipe.
On the server, append the files in order to an empty file, deleting the files one by one once it has been appended. Using touch next_file; for f in ordered_list_of_files; do cat $f >> next_file; rm $f; done or some variant should do it.
You can of course prepare the next file while you upload the previous file to use concurrency at its maximum. The bottleneck will be either your encryption algorithm (CPU), you network bandwidth, or your disk bandwidth.
This method will waste you 2 GB of disk space on the client side (or less or more depending the size of the files), and 1 GB of disk space on the server side. But you can be sure that you will not have to do it again if your upload hang near the end.
If you want to be double sure about the result of the transfer, you could compute hash of you files while writing them to disk on the client side, and only delete the client file once you have verify the hash on the server side. The hash can be computed on the client side at the same time you are writing the file to disk using dd ... | tee local_file | sha1sum. On the server side, you would have to compute the hash before doing the cat, and avoid doing the cat if the hash is not good, so I cannot see how to do it without reading the file twice (once for the hash, and once for the cat).

You can write to a remote file using ssh:
program | ssh -l userid host 'cd /some/remote/directory && cat - > filename'

This is a sample of uploading to ftp site by curl
wget -O- http://www.example.com/test.zip | curl -T - ftp://user:password#ftp.example.com:2021/upload/test.zip

Syncing two files when one is still being written to

I have an application (video stream capture) which constantly writes its data to a single file. Application typically runs for several hours, creating ~1 gigabyte file. Soon (in a matter of several seconds) after it quits, I'd like to have 2 copies of file it was writing - let's say, one in /mnt/disk1, another in /mnt/disk2 (the latter is an USB flash drive with FAT32 filesystem).
I don't really like an idea of modifying the application to write 2 copies simulatenously, so I though of:
Application starts and begins to write the file (let's call it /mnt/disk1/file.mkv)
Some utility starts, copies what's already there in /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
After getting initial sync state, it continues to follow a written file in a manner like tail -f does, copying everything it gets from /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
Several hours pass
Application quits, we stop our syncing utility
Afterwards, we run a quick rsync /mnt/disk1/file.mkv /mnt/disk2/file.mkv just to make sure they're the same. In case if they're the same, it should just run a quick check and quit fairly soon.
What is the best approach for syncing 2 files, preferably using simple Linux shell-available utilities? May be I could use some clever trick with FUSE / md device / tee / tail -f?
Solution
The best possible solution for my case seems to be
mencoder ... -o >(
tee /mnt/disk1/file.mkv |
tee /mnt/disk2/file.mkv |
mplayer -
)
This one uses bash/zsh-specific magic named "process substitution" thus eliminating the need to make named pipes manually using mkfifo, and displays what's being encoded, as a bonus :)

Hmmm... the file is not usable while it's being written, so why don't you "trick" your program into writing through a pipe/fifo and use a 2nd, very simple program, to create 2 copies?
This way, you have your two copies as soon as the original process ends.

Read the manual page on tee(1).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string