I'm studying Apache Storm. I read the source code && developer documentation && JavaDoc && other useful blogs about Storm.
A question confused me a lot. Most documentation and blogs said that different scheduler lead to different assignment style when Storm Cluster assign a topology to Workers. But I confused that what is the role performed by Strategies in package "org.apache.storm.scheduler.resource.strategies.XXX" such as DefaultResourceAwareStrategy and other two strategies when Storm assign a topology ?
In Storm example program, I remembered these strategies applied in method setTopologyStrategy() to choose a strategy when create a topology. So,what is the different between Scheduler like ResourceAwareSchedule and Strategies like DefaultResourceAwareStrategy in the task assignment? Which one is the real factor to decide the task assignment?
I searched this problem on google but it not have a very clearly answer. I wish I can get a reply as clear as possible to explain the difference of Scheduler and Strategy. Thanks a lot.
Storm has a few different schedulers as you note. Some of them don't take the resources of the supervisor nodes into account. ResourceAwareScheduler is a scheduler implementation that can take supervisor resources/load into account when deciding where to assign a topology.
In order for ResourceAwareScheduler to be flexible, it uses a strategy to figure out how to rank the different supervisors. ResourceAwareScheduler contains the common code necessary to be resource aware, while the strategies do the actual scheduling. The ResourceAwareScheduler uses the strategy to do the scheduling, if that makes sense.
Look at https://github.com/apache/storm/blob/master/storm-server/src/main/java/org/apache/storm/scheduler/resource/strategies/scheduling/DefaultResourceAwareStrategy.java#L108 and the corresponding line in GenericResourceAwareStrategy. The comments there explain what the different strategies do.
Related
Let's say to cite a simple example, that I have a very simple beam pipeline which just reads from a file and dumps the data into an output file. Now let's consider that the input file is huge (some GBs in size, the type of file you can't typically open in a text editor). Since the direct-runner implementation is quite simple (it reads the whole input set into memory), it won't be able to read and output those huge files (unless you assign an impractically high amount of memory to the java vm process); so my question is: "How do production runners like flink/spark/cloud dataflow" deal with this 'huge dataset problem'? - assuming they would not just try to put the whole file(s)/dataset into memory?" -.
I'd expect production runner's implementation need to work "in parts or batches" (like reading/processing/outputting in parts) to avoid trying to fit huge datasets into memory at any specific point in time. Can somebody please share their feedback regarding how production runners deal with this "huge data" situation?
Generalizing, please notice this applies for other input/output mechanisms too, for example if my input is a PCollection coming from huge database table (broadly speaking huge in both row-size and amount), does the internal implementation of the production's runner somehow divides the given input SQL statement into many internally generated sub statements each taking smaller subsets (for example by internally generating a count(-) statement, followed by N statements, each taking (count(-)/N) elements? the direct-runner won't do this and will just pass the given query 1:1 to the DB), or is my responsibility as a developer to "iterate in batches" and divide the problem, and if this is indeed the case, what are the best practices here, ie: having one pipeline for this or many?, and if only one then somehow parametrise the pipeline to read/write in batches? or iterate over a simple pipeline and manage necessary metadata externally to the pipeline?
thanks in advance, any feedback would be greatly appreciated!
EDIT (reflecting David's feedback):
David your feedback is highly valuable and definitely touches the point i'm interested in. Having a work discovery phase for splitting a source and and read phase to concurrently read the split-partitions is definitely what I was interested in hearing, so thanks for pointing me in the right direction. I have a couple of small follow up questions if you don't mind:
1 - The article points out under the section "Generic enumerator-reader communication mechanism" the following:
"The SplitEnumerator and SourceReader are both user implemented class.
It is not rare that the implementation require some communication
between these two components. In order to facilitate such use cases [....]"
So my question here would be, is that "splitting + reading behaviour" triggered by some user (ie. developer) provided implementation (specifically SplitEnumerator and SourceReader), or can I benefit from that out of the box without any custom code?.
2 - Probably just delving deeper into the question above; if I have a batch/bounded workload (let's say I'm using apache flink), and I'm interested in processing a "huge file" like described in the original post, will the pipeline work "out of the box" (doing the behind the scenes "work preparation phase" splits and the parallel reads), or would that require some custom code implemented by the developer?
thank's in advance for all your valuable feedback!
Note that when the inputs are bounded and known in advance (i.e., a batch workload as opposed to streaming), this is more straightforward.
In Flink, which is designed with streaming in mind, this is done by separating "work discovery" from "reading". A single SplitEnumerator runs once and enumerates the chunks to be read (the splits/partitions), and assigns them to parallel readers. In the batch case a split is defined by a range of offsets, while in the streaming case, the end offset for each split is set to LONG_MAX.
This is described in more detail in FLIP-27: Refactor Source Interface.
Just to provide some closure to this question, the justification for this question was to know if apache beam - when coupled with a production runner-(like flink or spark or google cloud dataflow), offered out of the box mechanisms for -splitting work a.k.a reading/writing manipulating - huge files (or datasources in general). The comment provided by David Anderson above proved of great value in hintintg at how Apache flink deals with this workflows.
At this point I've implemented solutions using huge files (for testing possible IO bottlenecks) with a "beam on flink" based pipeline, and I can confirm, that flink will create an excecution plan which includes splitting sources, and dividing work in such a way that no memory problem arises. Now, there can be of course conditions under which stability/"IO performance" is compromised, but at least I can confirm that the workflows carried out behind the pipeline abstraction, uses the filesystem when carriying out tasks, avoiding fitting all data in memory and thus avoiding trivial memory errors. Conclusion: yes "beam on flink" (and likely spark and dataflow too) do offer proper work preparation, work splitting and filesystem usage so that available volatile memory is used in an efficient way.
Update about datasources: Regarding DBs as datasources, Flink won't (and can't - it is not trivial) optimize/split/distribute work related to DB datasources in the same way it optimizes reading from the filesystem. There are still approaches to read huge amount of data (records) from a DB though, but the implementation details need to be addressed by the developer instead of being responsibility of the framework. I've found this article (https://nl.devoteam.com/expert-view/querying-jdbc-database-in-parallel-with-google-dataflow-apache-beam/) very helpful in addressing the point of reading massive amounts of records from a DB in beam (the article uses a cloud dataflow runner, but I used Flink and it worked just fine), splitting queries and distributing the processing.
Before you read my question: This topic fits to more than one StackExchange site (Mathematics, Software Recommendations, Software Engineering, Stackoverflow) so I putted it into most popular one. Move it please if you think it fits somewhere else better.
TL;DR: I need something useful what I can compute in simple distributed calculation app and what is not one of the most common things (DNA, fractals, ...)
End of semester is coming and I have an semestral work to do for subject "Distributed systems". The task is to make distributed system (across few physical devices connected by LAN). I have some options like distributed chat, shared variable, or what I prefer, distributed calculation.
My question is what can I compute on this. If I choose this topic I want it to be useful for something.
I do not have knowlege of biomedicine (to compute DNA), advanced mathematics (eg. fractals) or similar stuff for what are distributed systems used mostly.
Do you guys have some ideas?
PS: It is not important but I will code it most likely in Node.JS or Java
You can go with prime numbers calculation using brute force, i assume the value of your project is not in the efficiency of the algorithm, more on how you are distributing the calculation.
Something that would be really interesting could be to execute queries using distributed calculation. Depending on your familiarity with databases and on the time you can devote, you can support as many types of queries as you find challenging and interesting (e.g distributed join).
To elaborate, you will have a number of nodes and some data that will be partitioned across those nodes and you will have a client performing queries on all those data. Your system will be able to answer those queries by doing some local computation on each node and then combining the results in a meaningful way to return the final answer.
To sum up, your project would be a simplified distributed query engine.
I'm working on a Mesos framework to run some jobs and it seems like a great opportunity to learn about making a highly available system. To that end, I'm doing some reading on distributed systems and I made the mistake of visiting wikipedia.
The passage in question is talking about a principle of HA engineering:
Reliable crossover. In multithreaded systems, the crossover point itself tends to
become a single point of failure. High availability engineering must provide for reliable
crossover.
My google-fu teaches me three things:
1) audio crossover devices split a single input into multiple outputs
2) genetic algorithms use crossover to combine solutions
3) buzzwordy white papers all copied from this wikipedia article :/
My question: What does a 'crossover point' mean in this context, and why is it single point of failure?
Reliable crossover in this context means:
The ability to switch from a node X (which is broken somehow) to a Node Y without losing data.
Non-reliable HA-database example:
Copy the database every 5 minutes to a passive node. => Here you can lose up to 5 minutes of data.
=> Here the copy action is the single point of failure.
Reliable HA-database example:
Setting up data replication where (per example) your insert statement only returns as "executed OK" when the transaction is copied to the secondary server.
(yes: data replication is more complex than this, this is a simplified example in the context of the question)
I would like to refer to the question that can be found at this address :
Running alloy analyzers in parallel
Is there any ongoing research or conclusion reached on the decomposition of Alloy models, in order to allow a more optimal analysis of models ?
This interests me greatly.
I am very interesting in this topic too. Maybe we can think about it like this. When starting Alloy engine to solve a constraint, we can call a function from Alloy and ask it to solve one constraint. I think we can call this function in cluster mode, and ask each node to solve one constraint. Then, we can compute the subset of the results from each node. See here for example: http://alloy.mit.edu/alloy/code/ExampleUsingTheCompiler.java.html
I am not sure we can work like this, but it's worthy to think about it in MapReduce like framework.
I am in the early stages of design of an application that has to be highly available and scalable. I want to use an eventual consistency data model for this for a number of reasons. I know and understand why this is an unpopular architectural choice for many solutions, but it's important in my case.
I am looking for real-world advice, best-practices and gotchas to look out for when dealing with distributed / document-style databases. And particularly areas around e-commerce (shopping cart style) apps that traditionally are easier to put together with a relational db.
I understand using these types of DB is challenging, but hey, Google and E-bay use them so they can't be that hard ;-) Any advice would be appreciated.
If you want to have a Distributed System (that "Eventual Consistency" thing) you need people, build, maintain and to operate it.
I found that there are three classes of people which have very little problems with "Eventual Consistency":
People with a solid background in distributed systems. They have learned about Eventual Consistency Byzantine Failures and stuff like that. If you understand that Paxos is not about holidays, you are probably one of them.
People experienced in network programming. They might miss the theoretical background but have an intuitive understanding of asynchronity and the "no global clocks & counters" paradigm. If you own at least 8 books by Richard Stevens you are probably one of them.
Very experienced coders which had little exposure to RDBMS. Kernel guys, people from scientific computing and the gaming industry come to mind.
All in all this people are very sought after in the job market. For example 75% or so of the academics in distributed systems leave for institutions who run big, self-designed distributed systems, e.g. the stock exchanges.
The whole thing got somewhat simpler with offerings like Hardoop, SimpleDB and CouchDB but it is still a big challenge to build something on distributed systems technology.
On the other Hand RDBMS are a very fine pice of engineering. They are well understood and expertise on them is available the job market. There are a lot of decent tools, education opportunities and lots of highly skilled experts are available to be rented by the hour. So think twice of you can't get on with a RDBMS approach - perhaps coupled with some clever cheating. I usually point students to the Lifejournal architecture.
For Distributed Databases there is much less experience. That's exactly the reason you have found so little advice so far.
If you are determined to use "Eventual Consistency" I think besides immature tools the main challenge is the mindset of everyone involved. Are your API users (coders) and application users (your employees and your customers) are willing and able to accept the inconsistency? Can you hide it from certain classes of users? We are not used to that mindset that computers are inconsistent. Something is in stock or it isn't. "Maybe" isn't an answer users expect.
Also keep in mind that "eventual" can mean a very long time to algorithm designers. For how long can you accept inconsistency?
For a shopping cart application you might want to go truly distributed: Use the Clients Browser as data store. On checkout you can submit the cart to the server side batch processing system. This means for the catalog you need read only high availability (easier) and the cart submission is a very narrow interface with no need for transactions. Later on the processing of the order has no (Soft) real time requirements and thus is easier.
BTW: Last time I checked on E-Bay architecture they where big in RDBMS but it may have changed since then. (Edit: it did change - see comments)
The only solution to your problem is to decide which tradeoffs in the CAP theorem are right for you, then begin implementing it.
mdorseif has a great point. There are many configurations of to what extent you trade off consistency, availability, and partitioning. You have two main options.
Go the route of an in-house distributed system (takes lots of expertise and research)
Vet and experiment with a number of distributed databases to decide what can handle your requirements as scale.
This is probably an over-simplification. A real production-ready pipeline is an eco-system. It'll at least get you on the right track.
Appnexus is an ad platform that uses hbase for very high availability and eventual consistency. They talk a lot about this here.
An article on http://highscaleability.com outlines how the New York Times implemented RabbitMQ alongside Cassandra across a WAN for fault tolerance and high availability.
MongoDB provides a great deal of flexibility in balancing consistency with availability with their implementation of write concerns. They've got excellent documentation that highlights exactly how to implement it with all the gotchas (including partitioning). They implement the two-phase commit to maintain state across the network (on their config servers).
Google has a great paper on this subject, their photon project implements a highly scalable, highly reliable system with the paxos algoritm at the heart of it alongside a few other techniques. It also happens to be very consistent (with end-to-end latency of about 10s) and fault tolerant, standing up to regional failures.
All systems build on distributed computing models are build on CAP and BASE. Here the main concern is If our system provides Availability and Partition Tolerance we cannot have true consistency but we can have eventual consistency.
The idea behind eventual consistency is that each node is always available to serve requests. As a trade-off, data modifications are propagated in the background to other nodes. This means that at any time the system may be inconsistent, but the data is still largely accurate.
Source: http://www.techspritz.com/eventual-consistency-and-base-model/
How to achieve high availability and scalability using relational databases is well known and there is a vast body of knowledge out there on how to do this!
Google is a special case which does not apply to most sites, very very high volumes of queries, very very large amounts of data, and, most importantly no Service Level Agreements with most of its users. There is no correct answer to a Web search only better answers, for the average user Google is good enough, if Google misses a vital page from a search list you as a user cannot complain.
E-Bay is a rather different case, somehow they have persuaded there users and customers to accept poor service in exchange for theoretically lower prices -- good on them but this is not an option for every business.