Are downloads from spark distribution archive often slow? - apache-spark

I was trying to download spark-hadoop distribution from the website - https://archive.apache.org/dist/spark/spark-3.1.2/. Often I find that the downloads from this site are generally slow. Is it due to some generic issue with the site itself?
That the download is slow I have verified in two ways -
In Colab I have run the command !wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz which keeps running often for more than say 10 minutes. While at other times it executes within 1 minute.
From the website I tried downloading it and even then the download speed is extremely slow occasionally.

It maybe because
You download multiple times
You download from non-browser, for example curl/wget
Your location is physically far from file server or network is unstable.
or something else. for example file server is slow
I think most of public server has kind a "safe guard" to prevent DDoS, So their "Safe guard" control download traffic per sec.
I faced similar issue, when I download from browser, it took 1min, but It took 10min when I use curl

Related

For IIS, what is the best way to update static files while maintaining availability on a live website?

I've investigated this and found lots of related information, but nothing that answers my question.
A little background. A small set of files are shared by IIS 10 statically. These files need to be updated usually weekly, but not more than once an hour (unless someone manually runs an update utility for testing). The files are expected to be a couple K bytes in size, no larger then 10 kilobytes. The update process can be run on the IIS server and will be written in PowerShell or C#.
My plan for updating files that are actively being served as static files by IIS is:
Copy the files to a temporary local location (on the same volume)
Attempt to move the files to the IIS static site location
The move may fail if the file is in use (by IIS). Implement a simple retry strategy for this.
It doesn't cause a problem if there is a delay in publishing these files. What I really want to avoid is IIS trying to access one of the files at just the wrong time, a race condition while my file replacement is in process. I have no control over the HTTP client which might be a program that's not tolerant of the type of error IIS could be expected to return, like an HTTP status 404, "Not Found".
I have a couple random ideas:
HTTP GET the file from IIS before I replace it with the intention of getting the file into IIS's cache in hopes that will improve the situation.
Just ignore this potential issue an hope for the best
I can't be the only developer who's faced this. What's a good way to address this issue? (Or is it somehow not an issue at all and I'm just over thinking it?)
Thanks in advance for any help.

Measure speed between two exact sites

I would really like to measure connection speed between two exact sites. Naturally one of the sites is our site. Somehow I need to prove that not our internet connection is flaky, but that a site at the other end, is overcrowded.
At our end I have windows and linux machines available for this.
I imagine I would run a script at certain times of day which - for example - tries to download an image from that site and try to measure download time. Then put the download time into a database then create a graph from the records in the database. (I know that this is really simple and not sophisticated enough, but hence my question)
I need help on the time measurement.
The felt speed differences are big, sometimes the application works flawlessly, but sometimes we get timed out errors.
Now I use speedtest to check if our internet connection is OK, but this does not show that the site that is not working is slow, and now I can't provide hard numbers to assist my situation.
Maybe it is worth mentioning that the application we try to use at the other end is java based.
Here's how I would do it in Linux:
Use wget to download whatever URL you think represents your sites best. Parse the output into a file (sed, awk), use crontab to trigger the download multiple times.
wget www.google.com
...
2014-02-24 22:03:09 (1.26 MB/s) - 'index.html' saved [11251]

using torrents to back up vhd's

Hi it's a question and it may be redundant but I have a hunch there is a tool for this - or there should be and if there isn't I might just make it - or maybe I am barking up the wrong tree in which case correct my thinking:
But my problem is this: I am looking for some way to migrate large virtual disk drives off a server once a week via an internet connection of only moderate speed, in a solution that must be able to be throttled for bandwidth because the internet connection is always in use.
I thought about it and the problem is familar: large files that can moved that also be throttled that can easily survive disconnection/reconnection/large etc etc - the only solution I am familiar with that just does it perfectly is torrents.
Is there a way to automatically strategically make torrents and automatically "send" them to a client download list remotely? I am working in Windows Hyper-V Host but I use only Linux for the guests and I could easily cook up a guest to do the copying so consider it a windows or linux problem.
PS: the vhds are "offline" copies of guest servers by the time I am moving them - consider them merely 20-30gig dum files.
PPS: I'd rather avoid spending money
Bittorrent is an excellent choice, as it handles both incremental updates and automatic resume after connection loss very well.
To create a .torrent file automatically, use the btmakemetainfo script found in the original bittorrent package, or one from the numerous rewrites (bittornado, ...) -- all that matters is that it's scriptable. You should take care to set the "disable DHT" flag in the .torrent file.
You will need to find a tracker that allows you to track files with arbitrary hashes (because you do not know these in advance); you can either use an existing open tracker, or set up your own, but you should take care to limit the client IP ranges appropriately.
This reduces the problem to transferring the .torrent files -- I usually use rsync via ssh from a cronjob for that.
For point to point transfers, torrent is an expensive use of bandwidth. For 1:n transfers it is great as the distribution of load allows the client's upload bandwidth to be shared by other clients, so the bandwidth cost is amortised and everyone gains...
It sounds like you have only one client in which case I would look at a different solution...
wget allows for throttling and can resume transfers where it left off if the FTP/http server supports resuming transfers... That is what I would use
You can use rsync for that (http://linux.die.net/man/1/rsync). Search for the --partial option in man and that should do the trick. When a transfer is interrupted the unfinished result (file or directory) is kept. I am not 100% sure if it works with telnet/ssh transport when you send from local to a remote location (never checked that) but it should work with rsync daemon on the remote side.
You can also use that for sync in two local storage locations.
rsync --partial [-r for directories] source destination
edit: Just confirmed the crossed out statement with ssh

Commits are extremely slow/hang after a few uploads

I've recently started to notice really annoying problems with VisualSVN(+server) and/or TortoiseSVN. The problem is occurring on multiple (2) machines. Both running Windows 7 x64
The VisualSVN-server is running Windows XP SP3.
What happens is that after say, 1 2 or 3 (or a bit more, but almost always at the same file) the commit just hangs on transferring data. With a speed of 0bytes/sec.
I can't find any error logs on the Server. I also just asked for a 45day trial of Enterprise Server for its logging capabilities but no errors there as well.
Accessing the repository disk itself is fast, I can search/copy/paste to that disk/SVN repo disk just fine.
The Visual SVN Server also does not use excessive amounts of memory nor CPU usage, which stays around 0-3%.
Both the Server as well as TortoiseSVN's memory footprint moves/changes which would indicate at least "something" is happening.
Committing with Eclipse (different project (PHP), different repository on the server) is going great. No slow downs, almost instant commits, with 1 file or 50files. The Eclipse plugin that I use is Subclipse.
I am currently quite stuck on this problem and it is prohibiting us from working with SVN right now.
[edit 2011-09-08 1557]
I've noticed that it goes extremely slow at 'large' files, for instance a 1700MB .resx (binary) or 77KB .h source (text) file. 'small' files > 10KB go almost instantly.
[edit 2011-09-08 1608]
I've just added the code to code.google.com to see if the problem is on my end or the server end. Adding to google code goes just fine, no hangs at all. 2,17MB transferred in 2mins and 37secs.
I've found and fixed the problem. It appeared to have been a faulty NIC, speedtest.net resulted in ~1mbit, shoving in a different NIC pushed this to the max of 60mbit and solving my commit problems.

why script downloads prevent parallelism in browsers

According to this link:
"If you serve your images from multiple hostnames, you can get more than two downloads to occur in parallel. (I've gotten Internet Explorer to download over 100 images in parallel.) While a script is downloading, however, the browser won’t start any other downloads, even on different hostnames."
What I wonder is why downloading script prevents parallelism? I guess browser executes the script right after the download finishes. So why not to download them in parallel, but defer only execution? Is it an implementation choice? Is there any other reason for that?
The browser executes the script as it is downloading it, so it will not get any further on the page to see what else there is that it needs to download. You can use something like LABjs to help with this though: http://labjs.com/

Resources