What to compute on school project "Distributed calculation"? (Useful) - node.js

Before you read my question: This topic fits to more than one StackExchange site (Mathematics, Software Recommendations, Software Engineering, Stackoverflow) so I putted it into most popular one. Move it please if you think it fits somewhere else better.
TL;DR: I need something useful what I can compute in simple distributed calculation app and what is not one of the most common things (DNA, fractals, ...)
End of semester is coming and I have an semestral work to do for subject "Distributed systems". The task is to make distributed system (across few physical devices connected by LAN). I have some options like distributed chat, shared variable, or what I prefer, distributed calculation.
My question is what can I compute on this. If I choose this topic I want it to be useful for something.
I do not have knowlege of biomedicine (to compute DNA), advanced mathematics (eg. fractals) or similar stuff for what are distributed systems used mostly.
Do you guys have some ideas?
PS: It is not important but I will code it most likely in Node.JS or Java

You can go with prime numbers calculation using brute force, i assume the value of your project is not in the efficiency of the algorithm, more on how you are distributing the calculation.

Something that would be really interesting could be to execute queries using distributed calculation. Depending on your familiarity with databases and on the time you can devote, you can support as many types of queries as you find challenging and interesting (e.g distributed join).
To elaborate, you will have a number of nodes and some data that will be partitioned across those nodes and you will have a client performing queries on all those data. Your system will be able to answer those queries by doing some local computation on each node and then combining the results in a meaningful way to return the final answer.
To sum up, your project would be a simplified distributed query engine.

Related

How to deal with (Apache Beam) high IO bottlenecks?

Let's say to cite a simple example, that I have a very simple beam pipeline which just reads from a file and dumps the data into an output file. Now let's consider that the input file is huge (some GBs in size, the type of file you can't typically open in a text editor). Since the direct-runner implementation is quite simple (it reads the whole input set into memory), it won't be able to read and output those huge files (unless you assign an impractically high amount of memory to the java vm process); so my question is: "How do production runners like flink/spark/cloud dataflow" deal with this 'huge dataset problem'? - assuming they would not just try to put the whole file(s)/dataset into memory?" -.
I'd expect production runner's implementation need to work "in parts or batches" (like reading/processing/outputting in parts) to avoid trying to fit huge datasets into memory at any specific point in time. Can somebody please share their feedback regarding how production runners deal with this "huge data" situation?
Generalizing, please notice this applies for other input/output mechanisms too, for example if my input is a PCollection coming from huge database table (broadly speaking huge in both row-size and amount), does the internal implementation of the production's runner somehow divides the given input SQL statement into many internally generated sub statements each taking smaller subsets (for example by internally generating a count(-) statement, followed by N statements, each taking (count(-)/N) elements? the direct-runner won't do this and will just pass the given query 1:1 to the DB), or is my responsibility as a developer to "iterate in batches" and divide the problem, and if this is indeed the case, what are the best practices here, ie: having one pipeline for this or many?, and if only one then somehow parametrise the pipeline to read/write in batches? or iterate over a simple pipeline and manage necessary metadata externally to the pipeline?
thanks in advance, any feedback would be greatly appreciated!
EDIT (reflecting David's feedback):
David your feedback is highly valuable and definitely touches the point i'm interested in. Having a work discovery phase for splitting a source and and read phase to concurrently read the split-partitions is definitely what I was interested in hearing, so thanks for pointing me in the right direction. I have a couple of small follow up questions if you don't mind:
1 - The article points out under the section "Generic enumerator-reader communication mechanism" the following:
"The SplitEnumerator and SourceReader are both user implemented class.
It is not rare that the implementation require some communication
between these two components. In order to facilitate such use cases [....]"
So my question here would be, is that "splitting + reading behaviour" triggered by some user (ie. developer) provided implementation (specifically SplitEnumerator and SourceReader), or can I benefit from that out of the box without any custom code?.
2 - Probably just delving deeper into the question above; if I have a batch/bounded workload (let's say I'm using apache flink), and I'm interested in processing a "huge file" like described in the original post, will the pipeline work "out of the box" (doing the behind the scenes "work preparation phase" splits and the parallel reads), or would that require some custom code implemented by the developer?
thank's in advance for all your valuable feedback!
Note that when the inputs are bounded and known in advance (i.e., a batch workload as opposed to streaming), this is more straightforward.
In Flink, which is designed with streaming in mind, this is done by separating "work discovery" from "reading". A single SplitEnumerator runs once and enumerates the chunks to be read (the splits/partitions), and assigns them to parallel readers. In the batch case a split is defined by a range of offsets, while in the streaming case, the end offset for each split is set to LONG_MAX.
This is described in more detail in FLIP-27: Refactor Source Interface.
Just to provide some closure to this question, the justification for this question was to know if apache beam - when coupled with a production runner-(like flink or spark or google cloud dataflow), offered out of the box mechanisms for -splitting work a.k.a reading/writing manipulating - huge files (or datasources in general). The comment provided by David Anderson above proved of great value in hintintg at how Apache flink deals with this workflows.
At this point I've implemented solutions using huge files (for testing possible IO bottlenecks) with a "beam on flink" based pipeline, and I can confirm, that flink will create an excecution plan which includes splitting sources, and dividing work in such a way that no memory problem arises. Now, there can be of course conditions under which stability/"IO performance" is compromised, but at least I can confirm that the workflows carried out behind the pipeline abstraction, uses the filesystem when carriying out tasks, avoiding fitting all data in memory and thus avoiding trivial memory errors. Conclusion: yes "beam on flink" (and likely spark and dataflow too) do offer proper work preparation, work splitting and filesystem usage so that available volatile memory is used in an efficient way.
Update about datasources: Regarding DBs as datasources, Flink won't (and can't - it is not trivial) optimize/split/distribute work related to DB datasources in the same way it optimizes reading from the filesystem. There are still approaches to read huge amount of data (records) from a DB though, but the implementation details need to be addressed by the developer instead of being responsibility of the framework. I've found this article (https://nl.devoteam.com/expert-view/querying-jdbc-database-in-parallel-with-google-dataflow-apache-beam/) very helpful in addressing the point of reading massive amounts of records from a DB in beam (the article uses a cloud dataflow runner, but I used Flink and it worked just fine), splitting queries and distributing the processing.

How to send multiple timeframe data from MQL4 to a Node.js?

I am trying to get multiple time-frame data of different trading instrument ( _Symbol ) from MetaTrader4 Terminal to a node.
How can I do it?
Can we do it from the same EA inside a MetaTrader4 Terminal?
A.1: Yes, we can.
A.2: No, that initial idea is not a good one.
While the intention is clear, the idea to use a single EA to send live-data for multiple trading instruments is not working for the said interest well.
MQL4 code-execution environment has some fixed, hard-wired internal logic and due to these + plus due to the reality, how Capital Markets and Broker-type Market access mediators work, the solo-EA will never fit these requirements.
A simple call to
iOpen( aTradingInstrumentSymbolNAME, // iHigh, iLow, iClose, iVolume, iTime
aSelectedTimeFrameDefinedCODE,
aRelativeBarPTR
)
is by far not enough.
Professional solution will require a lot of care for a real-time handling capabilities, for unmasking the actual flow of mutually hiding events, for achieving minimalistic processing latencies, so a quite high engineering expertise will be needed.
Start with learning the basics about Scripts, benchmark all your critical code-sections with recording their actual durations in [us] and assure, your code will remain non-blocking under all circumstances. This will decide, whether more than one code-execution thread(s) will be necessary in prime-time / for peak-hour.
Having managed that, your way just started to lead in a direction towards your expected result.
Next one has to decide about a feasible inter-process / distributed-computing data-flow and signalling, needed for inter-platform integration.
Last, but not least, important point is the legal-side of such undertaking. It depends both on your local juri§$§$§diction and Broker's Terms & Conditions as no one would enjoy to celebrate a technically well mastered Project from inside of jail.
All that, quite an interesting Project.
iOpen(Symbol(),PERIOD_M1,1) - is the way to get data from M1 ( last bar ), if you need another timeframe - replace PERIOD_M1 with another ENUM_TIMEFRAMES. So what is the problem? Usually StackOverflow requires to see your MCVE-based example to help you.

High Availability - What does Crossover mean in this context?

I'm working on a Mesos framework to run some jobs and it seems like a great opportunity to learn about making a highly available system. To that end, I'm doing some reading on distributed systems and I made the mistake of visiting wikipedia.
The passage in question is talking about a principle of HA engineering:
Reliable crossover. In multithreaded systems, the crossover point itself tends to
become a single point of failure. High availability engineering must provide for reliable
crossover.
My google-fu teaches me three things:
1) audio crossover devices split a single input into multiple outputs
2) genetic algorithms use crossover to combine solutions
3) buzzwordy white papers all copied from this wikipedia article :/
My question: What does a 'crossover point' mean in this context, and why is it single point of failure?
Reliable crossover in this context means:
The ability to switch from a node X (which is broken somehow) to a Node Y without losing data.
Non-reliable HA-database example:
Copy the database every 5 minutes to a passive node. => Here you can lose up to 5 minutes of data.
=> Here the copy action is the single point of failure.
Reliable HA-database example:
Setting up data replication where (per example) your insert statement only returns as "executed OK" when the transaction is copied to the secondary server.
(yes: data replication is more complex than this, this is a simplified example in the context of the question)

Eventual Consistency

I am in the early stages of design of an application that has to be highly available and scalable. I want to use an eventual consistency data model for this for a number of reasons. I know and understand why this is an unpopular architectural choice for many solutions, but it's important in my case.
I am looking for real-world advice, best-practices and gotchas to look out for when dealing with distributed / document-style databases. And particularly areas around e-commerce (shopping cart style) apps that traditionally are easier to put together with a relational db.
I understand using these types of DB is challenging, but hey, Google and E-bay use them so they can't be that hard ;-) Any advice would be appreciated.
If you want to have a Distributed System (that "Eventual Consistency" thing) you need people, build, maintain and to operate it.
I found that there are three classes of people which have very little problems with "Eventual Consistency":
People with a solid background in distributed systems. They have learned about Eventual Consistency Byzantine Failures and stuff like that. If you understand that Paxos is not about holidays, you are probably one of them.
People experienced in network programming. They might miss the theoretical background but have an intuitive understanding of asynchronity and the "no global clocks & counters" paradigm. If you own at least 8 books by Richard Stevens you are probably one of them.
Very experienced coders which had little exposure to RDBMS. Kernel guys, people from scientific computing and the gaming industry come to mind.
All in all this people are very sought after in the job market. For example 75% or so of the academics in distributed systems leave for institutions who run big, self-designed distributed systems, e.g. the stock exchanges.
The whole thing got somewhat simpler with offerings like Hardoop, SimpleDB and CouchDB but it is still a big challenge to build something on distributed systems technology.
On the other Hand RDBMS are a very fine pice of engineering. They are well understood and expertise on them is available the job market. There are a lot of decent tools, education opportunities and lots of highly skilled experts are available to be rented by the hour. So think twice of you can't get on with a RDBMS approach - perhaps coupled with some clever cheating. I usually point students to the Lifejournal architecture.
For Distributed Databases there is much less experience. That's exactly the reason you have found so little advice so far.
If you are determined to use "Eventual Consistency" I think besides immature tools the main challenge is the mindset of everyone involved. Are your API users (coders) and application users (your employees and your customers) are willing and able to accept the inconsistency? Can you hide it from certain classes of users? We are not used to that mindset that computers are inconsistent. Something is in stock or it isn't. "Maybe" isn't an answer users expect.
Also keep in mind that "eventual" can mean a very long time to algorithm designers. For how long can you accept inconsistency?
For a shopping cart application you might want to go truly distributed: Use the Clients Browser as data store. On checkout you can submit the cart to the server side batch processing system. This means for the catalog you need read only high availability (easier) and the cart submission is a very narrow interface with no need for transactions. Later on the processing of the order has no (Soft) real time requirements and thus is easier.
BTW: Last time I checked on E-Bay architecture they where big in RDBMS but it may have changed since then. (Edit: it did change - see comments)
The only solution to your problem is to decide which tradeoffs in the CAP theorem are right for you, then begin implementing it.
mdorseif has a great point. There are many configurations of to what extent you trade off consistency, availability, and partitioning. You have two main options.
Go the route of an in-house distributed system (takes lots of expertise and research)
Vet and experiment with a number of distributed databases to decide what can handle your requirements as scale.
This is probably an over-simplification. A real production-ready pipeline is an eco-system. It'll at least get you on the right track.
Appnexus is an ad platform that uses hbase for very high availability and eventual consistency. They talk a lot about this here.
An article on http://highscaleability.com outlines how the New York Times implemented RabbitMQ alongside Cassandra across a WAN for fault tolerance and high availability.
MongoDB provides a great deal of flexibility in balancing consistency with availability with their implementation of write concerns. They've got excellent documentation that highlights exactly how to implement it with all the gotchas (including partitioning). They implement the two-phase commit to maintain state across the network (on their config servers).
Google has a great paper on this subject, their photon project implements a highly scalable, highly reliable system with the paxos algoritm at the heart of it alongside a few other techniques. It also happens to be very consistent (with end-to-end latency of about 10s) and fault tolerant, standing up to regional failures.
All systems build on distributed computing models are build on CAP and BASE. Here the main concern is If our system provides Availability and Partition Tolerance we cannot have true consistency but we can have eventual consistency.
The idea behind eventual consistency is that each node is always available to serve requests. As a trade-off, data modifications are propagated in the background to other nodes. This means that at any time the system may be inconsistent, but the data is still largely accurate.
Source: http://www.techspritz.com/eventual-consistency-and-base-model/
How to achieve high availability and scalability using relational databases is well known and there is a vast body of knowledge out there on how to do this!
Google is a special case which does not apply to most sites, very very high volumes of queries, very very large amounts of data, and, most importantly no Service Level Agreements with most of its users. There is no correct answer to a Web search only better answers, for the average user Google is good enough, if Google misses a vital page from a search list you as a user cannot complain.
E-Bay is a rather different case, somehow they have persuaded there users and customers to accept poor service in exchange for theoretically lower prices -- good on them but this is not an option for every business.

Production, Test, Developer Environments vs Security

What are current practices for enabling developers to build systems that contain private data? Can anyone point to a "best practices" guide for that sort of thing?
We have a Catch-22 here in that developers need to write applications that go against systems that have data that is considered "private." The IT administration would like for us developers to not have access to the data (ie. provide a schema or data structure, but not data itself) whereas most developers (myself included) would like to have access to the production data since not having a representative dataset can lead to bad assumptions (eg. the format of data) and bugs later on.
Does anyone have any formalized "best practices" for this type of thing? Especially official guildines from some "BigCo" (eg. Microsoft, IBM) might help since it is needed to convince management.
My view of the world may be different, as I'm based in the UK, but for the past 20-odd years, I've worked primarily in the public sector on systems handling sensitive data.
The rules are **completely** cut-and-dried. No production data is allowed on the development estate.
As a fundamental principle, we do not want to be responsible for the loss of sensitive data. The users are perfectly good at that, themselves.
Within the past 12 months, my wife has moved from the same regime to one in the private sector where they allow developers access to production data and she's horrified by it. The legal implications (in the UK, at least) can be severe.
Developers don't **need** access to production data. It's simply laziness. Define and create test data to exercise defined test cases (including edge cases) and don't rely on the random-esque nature of production data.
If you **must** use production data (i.e. you manage to convince someone who doesn't know any better that it's acceptable), ensure the data is anonymised **before** it reaches the development estate.
Often times, a subset of sanitized data will be provided that is representative of the private data, but not the private data itself.
At my company, we started using Red-gate's data generator to generate test data. There is a bit of setup, but you can use the tools to generate very usable test data. Yes, I would prefer to use live production data, but it's not feasible (especially if you need to consider in HIPAA). It uses regex for each column and allows you to use look-up table's for related tables.
At MediumCo, we strip proprietary data out of our production data in Test and Dev. It has hurt us a little in the past to not have exactly-representative data, but the clients have asked about this point before, and it's usually not an issue, as the environments are populated with a lot of fake proprietary data.
I don't have any best practices paper or anything. But I would think that if you're developing out of an environment that is as protected as the environment that hosts the data in production, there wouldn't be a lot of argument to be made against it.
That is, if your production database is in a datacenter hosted and controlled and secured by your IT staff, if you have a development database that lives in the exact same scenario and doesn't offer any new ways to access the information - you would be in pretty good shape. As an added token of good will - it might be nice to offer to allow anyone worried about security a chance to do some kind of penetration test to ensure that you're telling the truth about security.
The other side of this, of course, is the analysis of the cost for not using the data: that is, it will lead to buggier code, which will cost $xxxxxx.xx in development time vs. virtually no cost to allow a small subset of your development team access to said data.
To avoid the need to manually sanitise/anonymise data, you could use random text replacement - to replace every alphanumeric character in each text field with a random alphanumeric. This:
keeps the data similar in length, size etc. from the developer's point of view
does not cause problems with character sets
leaves date and number fields untouched, which allows for accurate testing with respect to date ranges and quantities
will satisfy most privacy requirements
If you wanted to go a little further you could run random number-for-number replacement on telephone numbers and zip codes, while using alphanumeric replacement on other text fields.
Having an automated replacement script allows you to get up-to-date data dumps from the live system regularly, so your tests are up-to-date with respect to the size and variability of the data in practice.
It does mean that a small number of operations will not be realistic (e.g. indexing on name fields, which in real life are clustered around common letters) but these should be limited.

Resources