I want to be able to distribute bundles of files, about 500 MB per bundle, to all machines on a corporate "extranet" (which is basically a few LANs connected using various private mechanisms, including leased lines and VPN).
The total number of hosts is roughly 100, and the goal is to get a copy of the bundle from one host onto all the other hosts reliably, quickly, and efficiently. One important issue is that some hosts are grouped together on single fast LANs in which case the network I/O should be done once from one group to the next and then within each group between all the peers. This is as opposed to a strict central server system where multiple hosts might each fetch the same bundle over a slow link, rather than once via the slow link and then between each other quickly.
A new bundle will be produced every few days, and occasionally old bundles will be deleted (but that problem can be solved separately).
The machines in question happen to run recent Linuxes, but bonus points will go to solutions which are at least somewhat cross-platform (in which case the bundle might differ per platform but maybe the same mechanism can be used).
That's pretty much it. I'm not opposed to writing some code to handle this, but it would be preferable if it were one of bash, Python, Ruby, Lua, C, or C++.
I think all these problems have been solved by modern research into p2p networking and well packaged into nice forms. A bit of script and bit torrent should solve these problems. torrent clients exist for all modern OSs, then a script on each machine to check a location for a new torrent file, start the DL, then delete the old bundle once the DL has finished.
What about rsync?
I'm going to suggest you use compie's idea of rysnc to copy the files in which case you can use a scripting language of your choice.
On the propagating system you will need a script containing some form of representation of the hosts and a matrix between them weighted with the speed. You then need to calculate a minimum spanning tree from that information. From that, you can then send messages to the systems to which you intend to propagate detailing the MST and the bundle to fetch, whereby that script/daemon begins transfer. That host then contacts the hosts over the fastest links...
You could implement it in bash - python might be better or a custom C daemon.
When you update the network you'll need to update the matrix based on latest information.
See: Prim's Algorithm.
Related
I'm currently building a CodingGame like platform for a school project and I wondered how I could handle user script validation.
The goal is to have multiple exercises to solve with an expected response.
For example, there's an error to correct in the exercise and when corrected it should output "hello".
I don't want users to actually do "console.log('hello')" in the IDE on the website if possible :)
The most difficult part is that i don't know how to do execute the actual script given (sent as text to the API).
I just want to start with NodeJS available for the user, no Php or C...
Save the script as localfile wouldn't be a good option I think, that's how we do it for now but I don't know how Coding test platforms actually do it.
The API is hosted on AWS EC2 Amazon Linux instances.
Thank you for your help :)
This is rather an elaborate business requirement and not simply a doubt, but some points that could help you to progress in your project further -
A program is only a program when it is understood by some other program as instruction. Otherwise there's no difference in program.js and program.txt
Validating and running a program are two different things
Running a program on your own system vs allowing anyone to run the same program in your system are entirely different situations in terms of everything. (This is where the web & it's virtues come to play)
There will always be some risk when letting others provide instructions to your systems, as no validation is 100% effective, even if you check for all destructive commands, there will always be more of them. But then again, you'll have to consider this while setting up a virtual machine for the sole purpose of public use. (Like, keeping in mind that this machine can be lost at any moment, so the architecture should be able to handle such losses, either by spinning up a new machine or having a set of replaceable machines for this.)
Isolation & sandboxing of execution is the key in these kind of requirements. So you can have a look at tools like Docker, which could be a bit complex to understand initially, but if it suites your needs then simplicity is just one complexity away.
I'm writing a web application which will likely be updated frequently, including changes to css and js files which are typically cached aggressively by the browser.
In order for changes to become instantly visible to users, without affecting cache performance, I've come up with the following system:
Every resource file has a version. This version is appended with a ? sign, e.g. main.css becomes main.css?v=147. If I change a file, I increment the version in all references. (In practice I would probably just have a script to increment the version for all resources, every time I deploy an update.)
My questions are:
Is this a reasonable approach for production code? Am I missing something? (Or is there a better way?)
Does the question mark introduce additional overhead? I could incorporate the version number into the filename, if that is more efficient.
The approach sound reasonable. Here are some points to consider:
If you have many different resource files with different version numbers it might be quite some overhead for developers to correctly manage all these and increase them in the correct situations.
You might need to implement a policy for your team
or write a CI task to check that the devs did it right
You could use one version number for all files. For example when you have a version number of the whole app you could use that.
It makes "managing" the versions for developers a no-op.
It changes the links on every deploy
Depending on the number of resource files you have to manage the frequency of deploys vs. the frequency of deploys that change a resource file and the numbers of requests for these resource files one or the other solution might be more performant. This is a question of trade off.
I've been researching the idea of using distributed file system along with my dedicated servers instead of going with Amazon S3 and the results are nothing but massive headaches!
My project have the following characteristics/requirements:
User files are stored in dedicated servers. Each file is stored in 2 separate machines, located in different data centers (150-200 miles away from each other)
I'm using Amazon RDS to host the associated mysql database (*). It's fairly compact (only hold IDs/files metadata)
Files/data is around 50TB. Naturally, data does change and will definitely grow with time
My question is: is there a good general-purpose, distributed parallel fault-tolerant file system that have the following characteristics:
Stable & reasonably fast (upload/download)
Fairly easy to setup & maintain
Handle data storage so that I only have to care about removing/adding new servers if the need arise (ie. add new servers to the filesystem's server pool by editing a simple config, or something like that)
I've read about OpenStack, GlusterFS, MogileFS, XtreemFS, etc...but the more I read, the more I get confused!
(*) Yes, I realize the contradiction. Cost-wise it does make sense to host the database on RDS. But storing (up to) 50TB of users files on amazon is way too expensive compared to using dedicated servers (provided it's good enough).
PS. my app isn't live yet, so I'm open to suggestion if someone have a good idea that fits well in my case.
EDIT I'm not trying to make a S3 clone, I just need to use an existing hosting infrastructure to build small-scale cloud solution, my question is about finding the right distributed file system to handle/automate this.
We recently switched from an expensive storage solution to the opensource Lizardfs for our Distributed storage solution. It is quite simple to set up and scale once your understand the basic concept.
Check out https://docs.lizardfs.com/introduction.html#architecture for a quick overview. But forget about shadow master en meta loggers for now. What you need to know is that there are
a master: that regulates the traffic (make sure that has enough cpu)
chunkservers: which actually store the data. Use any kind of off the shelf hardware with a bunch of harddisks attached.
Clients: which are just simple mount points. So you can get a giant 50TB mount if you want. The master will tell the client where to find/store the files. The actual data is being transfered straight from the client->chunkserver and back.
You can add as many chunkservers as you want, the master will automatically try to balance your storage usage across them. Adding storage is a matter of adding harddrives, or adding servers. They don't have to be actual bare metal machines, but that is probably the cheapest.
There are 2 amazing features in lizardfs that allow georeplication.
Goals (see https://docs.lizardfs.com/adminguide/replication.html#standard-goals): How important are files to you. You can define, on a file level/folder level how many times a file needs to be replicated. Do you want 2 copies 3? 10? You could define a goal of 2 copies for old files that are simply there for archiving purposes. And define a goal of 4 copies on SSD drives for all new files.
Those same goals can also be used to do georeplication. You define that your data has to be stored it least two different locations by labeling your chunkservers accordingly. (e.g. DC1 and DC2)
Rack awareness (see https://docs.lizardfs.com/adminguide/advanced_configuration.html#configuring-rack-awareness-network-topology): you basically define IP ranges to teach the system how your network looks like. This way, clients will try to serve files from the closest server.
The ease of setting it up is what sold lizardfs for me. I've heard very good things about Ceph, but setting it up is another matter...
What worried me at first was how proven the technology is/was. So I spent quite a lot of research on figuring out who uses it.
Orange Poland (A large telecom provider) is one of the users.
And Cloudweavers/opennebula actualy built a business around it selling complete solutions.
Won't it take more than one person a few months a year to manage these servers? That will cost some $, then you have the cost of hosting the data yourself, then you have the added huge cost that the business / system you are building is not obviously scalable? In addition any likely investor will be turned away by a complex home grown data hosting system. How will you ensure integrity/security on par with Amazon? Your max savings per year look like $30,000 or so.
You could save money by doing a de-duplicated storage system where you just store all the unique chunks of data - also see rsync. Don't know how redundant your data is though.
I recommend LizardFS and GfarmFS.
IMHO Ceph is a major disappointment and so is XtreemFS.
If i want to develop a registry-like System for Linux, which Windows Registry design failures should i avoid?
Which features would be absolutely necessary?
What are the main concerns (security, ease-of-configuration, ...)?
I think the Windows Registry was not a bad idea, just the implementation didn't fullfill the promises. A common place for configurations including for example apache config, database config or mail server config wouldn't be a bad idea and might improve maintainability, especially if it has options for (protected) remote access.
I once worked on a kernel based solution but stopped because others said that registries are useless (because the windows registry is)... what do you think?
I once worked on a kernel based solution but stopped because others said that registries are useless (because the windows registry is)... what do you think?
A kernel-based registry? Why? Why? A thousand times, why? Might as well ask for a kernel-based musical postcard or inetd for all the point it is putting it in there. If it doesn't need to be in the kernel, it shouldn't be in. There are many other ways to implement a privileged process that don't require deep hackery like that...
If i want to develop a registry-like System for Linux, which Windows Registry design failures should i avoid?
Make sure that applications can change many entries at once in an atomic fashion.
Make sure that there are simple command-line tools to manipulate it.
Make sure that no critical part of the system needs it, so that it's always possible to boot to a point where you can fix things.
Make sure that backup programs back it up correctly!
Don't let chunks of executable data be stored in your registry.
If you must have a single repository, at least use a proper database so you have tools to restore, backup, recover it etc and you can interact with it without having a new set of custom APIs
the first one that come to my mind is somehow you need to avoid orphan registry entries. At the moment when you delete program you are also deleting the configuration files which are under some directory but after having a registry system you need to make sure when a program is deleted its configuration in registry should be deleted as well.
IMHO, the main problems with the windows registry are:
Binary format. This loses you the availability of a huge variety of very useful tools. In a binary format, tools like diff, search, version control etc. have to be specially implemented, rather than use the best of breed which are capable of operating on the common substrate of text. Text also offers the advantage of trivially embedded documentation / comments (also greppable), and easy programatic creation and parsing by external tools. It's also more flexible - sometimes configuration is better expressed with a full turing complete language than trying to shoehorn it into a structure of keys and subkeys.
Monolithic. It's a big advantage to have everything for application X contained in one place. Move to a new computer and want to keep your settings for it? Just copy the file. While this is theoretically possible with the registry, so long as everything is under a single key, in practice it's a non-starter. Settings tend to be diffused in various places, and it is generally difficult to find where. This is usually given as a strength of the registry, but "everything in one place" generally devolves to "Everything put somewhere in one huge place".
Too broad. Its easy to think of it as just a place for user settings, but in fact the registry becomes a dumping ground for everything. 90% of what's there is not designed for users to read or modify, but is in fact a database of the serialised form of various structures used by programs that want to persist information. This includes things like the entire COM registration system, installed apps, etc. Now this is stuff that needs to be stored, but the fact that its mixed in with things like user-configurable settings and stuff you might want to read dramatically lowers its value.
I was thinking about this the other day and wanted to see what the SO community had to say about the subject.
As it stands right now Common Lisp is getting some attention as a web development platform, and with good reason (of which I'm sure you are already convinced).
I was wondering how one would go about using a library in a shared environment in a similar fashion to PHP.
If I set up something like SBCL as an interperter to interpret FASL files like Python or PHP, what would be the best way to use libraries (like clsql for instance).
Most come as asdf installable libraries, but it would be a stupid amount of overhead to require and install the library each and every time a request is made.
Keeping in mind this is for shared hosting; would it be best to ..
1) Install system wide copies of the libraries for use in applications; reduces space, but there may be problems with using the correct version of the library.
2) Allow users (through a control panel) to install local copies for themselves; more space, no version problems.
3) Tell them to wrap it into a module and load it on demand like Python does (I'm not sure if/how this can be done with Lisp). Just being able to load a library for use would be the best option, but I don't think a lot of them are designed to be used this way.
Anyways, looking to hear your opinions, thanks.
There are two ways I would look at it:
start a Lisp for each request
This way it would be much better that the Lisp is a saved image with all necessary libraries and data loaded. But that approach does not look very promising to me.
run a Lisp and let a frontend (web browser, another web server, ...) connect to it
This way you can either start a saved image or a Lisp that loads a bunch of stuff once and serves the requests.
I like to use saved images/applications in a deployment scenario. They can be quickly started, contain all the necessary software and are independent of library changes.
So it might be useful to provide pre-configured Lisp images that contain the necessary software or let the user configure and save an image.