How would you go about describing the architecture of a "system" that splits a sensitive file into smaller pieces on different servers in order to protect the file?
Would we translate the file into bytes, and then distribute those bytes onto different servers? How would you even go about getting all the pieces back together in order to call the original file back (if you have the correct permissions)?
This is a theoretical problem that I do not know how to approach. Any hints at where I should start?
Not an authoritative answer but you will get many here as replies which provides partial answers to your question. It may just give you some idea.
My guess is, you would be creating a custom file system.
Take a look at various filesystems like
GmailFS: http://richard.jones.name/google-hacks/gmail-filesystem/gmail-filesystem.html
pyfilesystem: http://code.google.com/p/pyfilesystem/
A distributed file system in python: http://pypi.python.org/pypi/koboldfs
Hence architecturally, it will be very similar to way a typical distributed filesystem is implemented.
It should be a client/server architecture in master/slave mode. You will have to create a custom protocol for their communication.
Master process is what you will talk to for retrieving / writing your files.
Slave fs would be distributed across different servers which will keep a tagged file which contains partial bits of information of a file
Master fs will contain a per file entry that locates all sequence of tagged data distributed across various slave servers.
You could have redundancy with a tagged data being store on multiple server.
Communication protocol will have to be designed to allow multiple servers to respond back to requested tagged data. Master fs simply picks one and ignores others in the simplest case.
Usual security requirements needs to be respected for storing and communicating this information across servers.
You will be most interested in secure distributed filesystem implemented in Python : Tahoe
http://tahoe-lafs.org/~warner/pycon-tahoe.html
http://tahoe-lafs.org/trac/tahoe-lafs
Related
I'm trying to create a distributed system which contains mobile app, web userpanel and an API that communicates with DB. I want the user to be able to upload a profile image both from the mobile app and the web userpanel but what is the best and "right" way to store images accross a distributed system? Cant really find anything describing best practices on this topic.
I know that the filepath should be in database, and the image in a file system. But should that file system be on the API server or where?
Here is an diagram of what i think the distributed system should be like.
The "right" way to do something complex like image hosting depends on factors like expected traffic and performance expectations. Designing large systems involves a lot of tradeoffs, so it's best to nail down what requirements are for your system are in order to make decisions that serve those requirements.
As for your question, this diagram is roughly correct - you want to store the location of the uploaded image separate from the image itself. If you wanted your solution to be more scalable, an approach would be turning your file system into its own service with its own API. You would store a hash of the file in your database to reference it rather than its path, then request that image (or a URL to that image) from the new storage service by asking the storage service's API for the file that has the stored hash.
The reason this is more scalable is that the storage service is free to become its own distributed system when we don't require that every file has an associated file system path within a single namespace. A hash is a good candidate for a replacement of the filesystem path, but you could come up with your own storage ID scheme depending on your needs.
However, this may be wildly out of scope for what you are trying to design. If you only expect to have a few thousand users, storing your images and database on your API server in the file system isn't necessarily wrong, but you might experience growing pains if the requirements of your system grow.
Google's site reliability engineer classroom has a lesson on building a distributed image server, which is an adjacent problem to what you're looking to do: https://sre.google/classroom/imageserver/
TLDR;
I am wondering what is the best way to reliably share an object or other data between n number of webservers on n number of machines?
I have looked at the likes of redis but it seems that this would not be what I am actually looking for here. I am now thinking that something like IPC over remote / RPC might be more appropriate? Is there a better way to do this given it will be called at minimum 10 times over a 30 second interval which can exponentially grow as the number of users running servers grows too.
Example & current use case:
I run a multiplayer mod for a game which receives a decent level of traffic and we are starting to notice cases where requests get dropped sometimes. The backend webserver is written in NodeJS and uses express in a couple of places too. We are in the process of restructuring the system and we have now come to restructuring the part of the system that handles a heartbeat from each server that members of the public host. This information is then shared out to the players so they can decide which server to join.
Based on my own research I am looking to host the service on several different machines for redundancy. These machines are then linked over vlan / vswitch so that they have a secure method to communicate with eachother. The database system is already setup to replicate this way however I cannot see a performance inclined way to handle the sharing of objects containing information about the servers that have communicated with each webhost.
If it helps the system works something like this:
Users server -> my load balancer -> webhost (backend).
Player -> my load balancer -> webhost (backend) returns info on all currently online servers.
In the example above and what is currently in use is a single instance webserver which handles the requests and processes needed.
Just an idea while the community proposes answers: consider reading about Apache Thrift. It is not such as IPC like, but more an RPC like. If the architecture of your servers, or the different components or the "backend network" is in "star"... with one "master" I shoud consider that possibility.
If the architecture of your backend is not like that... but a group of "independent" entities, it comes to my mind to solve the functionality with some "data bus" such as private MQTT broker and a group of members, subscribed or publishing data for the rest of the network. The most optimal serialization strategy for the object would be in my opinion Google Protobuf.
The integration of Mqtt with nodeJS is very simple, and if the weight of the packets is not too big, and you can admit some latency, I would really recommend you to make some tests using Mqtt with a publish/subscription QoS=2. It would not take great efforts to substitute de underlying communications library that you are using.
Once that is said, it seems that there's another solution: Kafka, that seems very interesting (I don't really know it).
Your choice will depend on the nature of your data, mostly: weight of the packets, frequency per user, and the latency you are willing to admit for the worst of the scenarios.
I am using hazelcast jet 0.6.1 for real time analysis. There are multiple streams (mostly from remote journal) coming from different sources.
I would like to know, if full join supported between multiple streams.
If yes, will you please point me to some links / examples for full join between multiple streams.
I think you need to elaborate a bit more on what you are trying to do. Streams are theoretically infinite, so the term "full join" has to mean something different than it does in a database.
There are several types of joins available in Jet. As Can said above, there is a merge operator, but you might be thinking more of windowed join where you time bound the period of the joins.
Merge Steams is here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
Window Concepts are here:
https://docs.hazelcast.org/docs/jet/0.7.2/manual/#unbounded-stream-processing
*This is in response to the comment from the first answer, it's to large for another comment and I thought the first answer is still relevant
Is this the same data and data type, just from different nodes? Like app servers for a microservices architecture? It seems to me that you have a few options here that really come down to preferred overall architecture, especially about how you want to transport the events. A couple thoughts:
You can simply merge streams from different data sources if that fits the use case:
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#merge
If this is homogenous data, just distributed across app servers, if might be a case where you use the Hazelcast client on each app server to put events into an IMap (which is shared by all the app servers) with an Event Journal on a Hazelcast cluster. Then Jet just receives all the events from the Event Journal.
See: https://docs.hazelcast.org/docs/latest/manual/html-single/#event-journal
If you have Kafka available, perhaps you create a topic for the events from the servers and Jet receives the events from Kafka. Either way they are already merged when Jet gets them, so they are processed as one stream.
See: https://docs.hazelcast.org/docs/jet/0.7.2/manual/#kafka
In a microservice architecture, is it advisable to have a centralized collection of proto files and have them as dependency for clients and servers? or have only 1 proto file per client and server?
If your organization uses a monolithic code base (i.e., all code is stored in one repository), I would strongly recommend to use the same file. The alternative is only to copy the file but then you have to keep all the versions in sync.
If you share protocol buffer file between the sender and the receiver, you can statically check that both the sender and the receiver use the same schema, especially if some new microservices will be written in a statically typed language (e.g., Java).
On the other hand, if you do not have a monolithic code base but instead have multiple repositories (e.g., one per microservice), then it is more cumbersome to share the protocol buffers file. What you can do is to put them in separate repositories that can be added as an dependency to microservices that need them. That is what I have seen in my previous company. We had multiple small API repositories for the schema.
So, if it is easy to use the same file, I would recommend to do so instead of creating copies. There may be situations, however, where it is more practical to copy them. The disadvantage is that you always have to apply a change at all copies. In best case, you know which files to update, then it is just tedious. In the worst case, you do not know which files to update, and your schema will get out of sync. Only when the code is released, you will find out.
Note that monolithic code base does not mean monolithic architecture. You can have microservices and still keep all the source code together in one repository. The famous example is, of course, Google. Google also heavily uses protocol buffers for their internal communication. I have not seen their source code, but I would be surprised if they do not share their protocol buffer files between services.
I have a distributed system in mind (multiple nodes in a single datacenter) that I want to have the following properties:
nodes can enter and leave the system at any time.
There is no data replication between nodes.
Which node the client makes use of is up to the client (i.e. it could be consistent hashing, it could be something else)
no master (i.e. no central point of failure)
each node may receive a piece of information that needs to be forwarded to the rest of the nodes
What algorithms (links to papers are best) are suitable for this?
(I assume some of the answers will include P2P algorithms, but most of them that I've encountered in the past have acted more like distributed hash tables, where nodes enter and take over some part of the keyspace, etc. I also recognize that multicast with simple UDP messages might be appropriate here, but what existing work would help make the messaging reliable?)
How about trying to implement ADHOC nodes with JXTA? See the Practical JXTA II book available online at Scribd.