Aria2 max-connection not working - linux

I installed aria2 (1.18.1) in my ubuntu.
but the problem is it can not increase download connection.
aria2c -x 10 http://mirror.sg.leaseweb.net/speedtest/100mb.bin
[#fe34b4 2.1MiB/95MiB(2%) CN:4 DL:233KiB ETA:6m49s]
by default it only download from 4 connection not 10 (as i given).
i even try with another sites, but default 4 connection carried out while downloading
Solved:-
There are multiple options that influence the behavior:
--split -s Maximum number of concurrent splits (connections) per download. Defaults to 5, so unless you changed it, you will get max 5 connections for a single download no matter what -x.
--min-split-size -k A split should only be initiated when the split would be bigger than this. Defaults to 20M, meaning that when you download a 100M file, the time the download is split some data is already retrieved which means less slightly less than 100MB is remaining, meaning 4 splits (5 splits would create splits slightly less than 20MB).
There you have it. Please check out the manual for more information on the various options.
$ aria2c -k 1M -s 10 -x 10 http://mirror.sg.leaseweb.net/speedtest/100mb.bin
[#1325f1 7.4MiB/95MiB(7%) CN:10 DL:1.2MiB ETA:1m8s]

Related

gsutil copy with multithread doesn't finish copying all files

We have around 650 GB of data on google compute engine.
We need to move them to Cloud Storage to a Coldline bucket, so the best options we could find is to copy them with gsutil with parallel mode.
The files are from kilobytes to 10Mb max, and there are few million files.
The command we used is
gsutil -m cp -r userFiles/ gs://removed-websites/
On first run it copied around 200Gb and stopped with error
| [972.2k/972.2k files][207.9 GiB/207.9 GiB] 100% Done 29.4 MiB/s ETA 00:00:00
Operation completed over 972.2k objects/207.9 GiB.
CommandException: 1 file/object could not be transferred.
On second run it finished almost at the same place, and stopped again.
How can we copy these files successfully ?
Also the buckets that have the partial data are not being removed after deleting them. Console just says preparing to delete, and nothing happens, we waited more than 4 hours, any way to remove those buckets ?
Answering your first question, I can propose the several options. All of them based on data split and uploading by small portions of data.
You can try distributed upload from several machines.
https://cloud.google.com/storage/docs/gsutil/commands/cp#copying-tofrom-subdirectories-distributing-transfers-across-machines
In this case you are splitting data by safe chunks, like 50GB, and uploading it from several machines in parallel. But it requires machines, that is not required actually.
You still can try such splited upload on a single machine, but you need then some splitting mechanism, which will not upload all files at once, but by chunks. In this case, if some thing fails, you will need to reload only this chunk. In addition, you will have better accuracy and you'll be able to localize possible fail place if something happens.
Regarding, how you can delete them. Well, same technique as for upload. Divide data on chunks and delete them by chunks. Or, you can try to remove whole project, if it suitable for your situation.
Update 1
So, I checked gsutil interface and it is supports glob syntax. You can match with glob syntax, for example 200 folders, and launch this command 150 time (this will upload 200 x 500 = 30 000 folders).
You can use such approach and combine it with -m option, so this is partially that your script did, but might work faster. This will work for folders names and files as well.
If you provide examples of the folders names and files names it would be easier to propose appropriate glob pattern.
It could be that you are affected by gs-util issue 464. This happens when you are running multiple gs-util instances concurrently wit the -m option. Apparently these instances share a state directory which causes weird behavior.
One of the workarounds is to add parameters: -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24.
E.g.:
gsutil -o GSUtil:parallel_process_count=1 -o GSUtil:parallel_thread_count=24 -m cp -r gs://my-bucket .
I've just run into the same issue, and turns out that it's caused by the cp command running into an uncopyable file (in my case, a broken symlink) and aborting.
Problem is, if you're running a massively parallel copy with -m, the broken file may not be immediately obvious. To figure out which one it is, try a dry run rsync -n instead:
gsutil -m rsync -n -r userFiles/ gs://removed-websites/
This will clearly flag the broken file and abort, and you can fix or delete it and try again. Alternatively, if you're not interested in symlinks, just use the -e option and they'll be ignored entirely.

rabbitmq inet_gethost reading ~50 times per second /etc/hosts file

My task is to reduce load on Linux machine running rabbitmq. First place in top is taken by inet_gethost 4 ( there are two such processes, but one is constantly sitting on the top of the top ). I started analyzing that process with strace and it's showing huge number of opening and reading from /etc/hosts. strace -fp 4571 -e open 2> count && wc -l count revealed over 50 reads of that file per second. Question is if that kind of behavior is normal or is this a result of badly configured rabbit or some networking settings.

ArangoDB Too many open files

since a few days we encounter a problem with our ArangoDB installation. A few minutes/up to an hour after start up all connections to the database are refused. The arango log file says that there are "Too many open files". A "lsof | grep arango | wc -l" shows that the database has around 50,000 open file handles, which is a lot under the max. allowed by the linux system (around 3m).
Has anyone an idea where this error comes from?
We are using a Ubuntu Linux with a 3.13 kernel. 30 GB RAM and three cores. The database is still very small with around 1,5m entries and a size of 50GB.
Thx, secana
EDIT:
"netstat -anpt | fgrep 2480" shows:
root#syssec-graphdb-001-test:~# netstat -anpt | fgrep 2480
tcp 0 0 10.215.17.193:2480 0.0.0.0:* LISTEN 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.30:53453 ESTABLISHED 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.31:49299 ESTABLISHED 7741/arangod
tcp 0 0 10.215.17.193:2480 10.215.50.30:53155 ESTABLISHED 7741/arangod
"ulimit -n" has a result of 1024, so I think that the ~50,000 are all arango processes together.
Last lines in log file before the database died:
2015-05-26T12:20:43Z [9672] ERROR cannot open datafile '/data/arangodb/databases/database-235999516/collection-28464454696/datafile-18806474509149.db': 'Too many open files'
2015-05-26T12:20:43Z [9672] ERROR cannot open datafile '/data/arangodb/databases/database-235999516/collection-28464454696/datafile-18806474509149.db': Too many open files
2015-05-26T12:20:43Z [9672] DEBUG [arangod/VocBase/collection.cpp:1632] cannot open '/data/arangodb/databases/database-235999516/collection-28464454696', check failed
2015-05-26T12:20:43Z [9672] ERROR cannot open document collection from path '/data/arangodb/databases/database-235999516/collection-28464454696'
It looks like it will make sense to increase the max. number of open files a process is allowed to manage. Given the stated database size of around 50 GB, the (presumably default) value of 1024 seems to be too low.
arangod will require one file descriptor for each parallel client connection. That may not be many, but in the face of HTTP keep-alive connections this could already account for several file descriptors.
Additionally, each datafile of an active collection will need to be memory-mapped and cost one file descriptor as well. With the default datafile size of 32 MB, a database size of 50 GB (on disk) will already consume 1,600 file descriptors:
50 GB database size / (32 MB default size / 1 datafile) = 1600 datafiles
Increasing the ulimit -n value for the arangod user and environment therefore will make sense. You can confirm that arangod can actually use the configured number of file descriptors by starting it with option --server.descriptors-minimum <value>, e.g.
--server.descriptors-minimum 32768
for that many file descriptors. If arangod cannot effectively use that specified amount of file descriptors, it will fail at start with a fatal error. Of course that option can also be put into the arangod.conf file.
Additionally, the default size for (new) datafiles can be increased via the journalSize parameter for collections. That won't help right now, but will lower the number of required file descriptors for data saved in the future.
For emergencies when you can't restart the database, like in my case, you will find very useful this blog post that explains how you can change the ulimit of a running process.
If your distribution has util-linux-2.21, you can use the "prlimit" tool, or you can compile the small example C program in the blog post that worked great for me.
To check the actual limits of a process you can use:
cat /proc/<PID>/limits
Good luck!

Website Benchmarking using ab

I am trying my hand at various benchmarking tools for the website I am working on and have found Apache Bench (ab) to be an excellent tool for load testing. It is a command line tool and is very easy to use, apparently. However I have a doubt about two of its basic flags. The site I was reading says:
Suppose we want to see how fast Yahoo can handle 100 requests, with a maximum of 10 requests running concurrently:
ab -n 100 -c 10 http://www.yahoo.com/
and the explanation for the flags states:
Usage: ab [options] [http[s]://]hostname[:port]/path
Options are:
-n requests Number of requests to perform
-c concurrency Number of multiple requests to make
I guess I am just not able to wrap my head around number of requests to perform and number of multiple requests to make. What happens when I give them both together like in the example above?
Can anyone give me a simpler explanation of what these two flags do together?
In your example ab will create 10 connections to yahoo.com and request a page using each of them simultaneously.
If you omit -c 10 ab will create only one connection and create next only when the first completes(when we have the whole main page downloaded).
If we pretend that server's response time does not depend on the number of requests it is simultaneously handling, your example will complete 10 times faster than without -c 10.
Also: What is concurrent request (-c) in Apache Benchmark?
-n 100 -c 10 means "issue 100 requests, 10 at a time."

Linux: uploading unfinished files - with file size check (scp/rsync) [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I typically end up in the following situation: I have, say, a 650 MB MPEG-2 .avi video file from a camera. Then, I use ffmpeg2theora to convert it into Theora .ogv video file, say some 150 MB in size. Finally, I want to upload this .ogv file to an ssh server.
Let's say, the ffmpeg2theora encoding process takes some 15 minutes on my PC. On the other hand, the upload goes on with a speed of about 60 KB/s, which takes some 45 minutes (for the 150MB .ogv). So: if I first encode, and wait for the encoding process to finish - and then upload, it would take approximately
15 min + 45 min = 1 hr
to complete the operation.
So, I thought it would be better if I could somehow start the upload, in parallel with the encoding operation; then, in principle - as the uploading process is slower (in terms of transferred bytes/sec) than the encoding one (in terms of generated bytes/sec) - the uploading process would always "trail behind" the encoding one, and so the whole operation (enc+upl) would complete in just 45 minutes (that is, just the time of the upload process +/- some minutes depending on actual upload speed situation on wire).
My first idea was to pipe the output of ffmpeg2theora to tee (so as to keep a local copy of the .ogv) and then, pipe the output further to ssh - as in:
./ffmpeg2theora-0.27.linux32.bin -v 8 -a 3 -o /dev/stdout MVI.AVI | tee MVI.ogv | ssh user#ssh.server.com "cat > ~/myvids/MVI.ogv"
While this command does, indeed, function - one can easily see in the running log in the terminal from ffmpeg2theora, that in this case, ffmpeg2theora calculates a predicted time of completion to be 1 hour; that is, there seems to be no benefit in terms of smaller completion time for both enc+upl. (While it is possible that this is due to network congestion, and me getting less of a network speed at the time - it seems to me, that ffmpeg2theora has to wait for an acknowledgment for each little chunk of data it sends through the pipe, and that ACK finally has to come from ssh... Otherwise, ffmpeg2theora would not have been able to provide a completion time estimate. Then again, maybe the estimate is wrong, while the operation would indeed complete in 45 mins - dunno, never had patience to wait and time the process; I just get pissed at 1hr as estimate, and hit Ctrl-C ;) ...)
My second attempt was to run the encoding process in one terminal window, i.e.:
./ffmpeg2theora-0.27.linux32.bin -v 8 -a 3 MVI.AVI # MVI.ogv is auto name for output
..., and the uploading process, using scp, in another terminal window (thereby 'forcing' 'parallelization'):
scp MVI.ogv user#ssh.server.com:~/myvids/
The problem here is: let's say, at the time when scp starts, ffmpeg2theora has already encoded 5 MB of the output .ogv file. At this time, scp sees this 5 MB as the entire file size, and starts uploading - and it exits when it encounters the 5 MB mark; while in the meantime, ffmpeg2theora may have produced additional 15 MB, making the .ogv file 20 MB in total size at the time scp has exited (finishing the transfer of the first 5 MB).
Then I learned (joen.dk ยป Tip: scp Resume) that rsync supports 'resume' of partially completed uploads, as in:
rsync --partial --progress myFile remoteMachine:dirToPutIn/
..., so I tried using rsync instead of scp - but it seems to behave exactly the same as scp in terms of file size, that is: it will only transfer up to the file size read at the beginning of the process, and then it will exit.
So, my question to the community is: Is there a way to parallelize the encoding and uploading process, so as to gain the decrease in total processing time?
I'm guessing there could be several ways, as in:
A command line option (that I haven't seen) that forces scp/rsync to continuously check the file size - if the file is open for writing by another process (then I could simply run the upload in another terminal window)
A bash script; say running rsync --partial in a while loop, that runs as long as the .ogv file is open for writing by another process (I don't actually like this solution, since I can hear the harddisk scanning for the resume point, every time I run rsync --partial - which, I guess, cannot be good; if I know that the same file is being written to at the same time)
A different tool (other than scp/rsync) that does support upload of a "currently generated"/"unfinished" file (the assumption being it can handle only growing files; it would exit if it encounters that the local file is suddenly less in size than the bytes already transferred)
... but it could also be, that I'm overlooking something - and 1hr is as good as it gets (in other words, it is maybe logically impossible to achieve 45 min total time - even if trying to parallelize) :)
Well, I look forward to comments that would, hopefully, clarify this for me ;)
Thanks in advance,
Cheers!
May be you can try sshfs (http://fuse.sourceforge.net/sshfs.html).This being a file system should have some optimization though I am not very sure.

Resources