How to deal with (Apache Beam) high IO bottlenecks? - apache-spark

Let's say to cite a simple example, that I have a very simple beam pipeline which just reads from a file and dumps the data into an output file. Now let's consider that the input file is huge (some GBs in size, the type of file you can't typically open in a text editor). Since the direct-runner implementation is quite simple (it reads the whole input set into memory), it won't be able to read and output those huge files (unless you assign an impractically high amount of memory to the java vm process); so my question is: "How do production runners like flink/spark/cloud dataflow" deal with this 'huge dataset problem'? - assuming they would not just try to put the whole file(s)/dataset into memory?" -.
I'd expect production runner's implementation need to work "in parts or batches" (like reading/processing/outputting in parts) to avoid trying to fit huge datasets into memory at any specific point in time. Can somebody please share their feedback regarding how production runners deal with this "huge data" situation?
Generalizing, please notice this applies for other input/output mechanisms too, for example if my input is a PCollection coming from huge database table (broadly speaking huge in both row-size and amount), does the internal implementation of the production's runner somehow divides the given input SQL statement into many internally generated sub statements each taking smaller subsets (for example by internally generating a count(-) statement, followed by N statements, each taking (count(-)/N) elements? the direct-runner won't do this and will just pass the given query 1:1 to the DB), or is my responsibility as a developer to "iterate in batches" and divide the problem, and if this is indeed the case, what are the best practices here, ie: having one pipeline for this or many?, and if only one then somehow parametrise the pipeline to read/write in batches? or iterate over a simple pipeline and manage necessary metadata externally to the pipeline?
thanks in advance, any feedback would be greatly appreciated!
EDIT (reflecting David's feedback):
David your feedback is highly valuable and definitely touches the point i'm interested in. Having a work discovery phase for splitting a source and and read phase to concurrently read the split-partitions is definitely what I was interested in hearing, so thanks for pointing me in the right direction. I have a couple of small follow up questions if you don't mind:
1 - The article points out under the section "Generic enumerator-reader communication mechanism" the following:
"The SplitEnumerator and SourceReader are both user implemented class.
It is not rare that the implementation require some communication
between these two components. In order to facilitate such use cases [....]"
So my question here would be, is that "splitting + reading behaviour" triggered by some user (ie. developer) provided implementation (specifically SplitEnumerator and SourceReader), or can I benefit from that out of the box without any custom code?.
2 - Probably just delving deeper into the question above; if I have a batch/bounded workload (let's say I'm using apache flink), and I'm interested in processing a "huge file" like described in the original post, will the pipeline work "out of the box" (doing the behind the scenes "work preparation phase" splits and the parallel reads), or would that require some custom code implemented by the developer?
thank's in advance for all your valuable feedback!

Note that when the inputs are bounded and known in advance (i.e., a batch workload as opposed to streaming), this is more straightforward.
In Flink, which is designed with streaming in mind, this is done by separating "work discovery" from "reading". A single SplitEnumerator runs once and enumerates the chunks to be read (the splits/partitions), and assigns them to parallel readers. In the batch case a split is defined by a range of offsets, while in the streaming case, the end offset for each split is set to LONG_MAX.
This is described in more detail in FLIP-27: Refactor Source Interface.

Just to provide some closure to this question, the justification for this question was to know if apache beam - when coupled with a production runner-(like flink or spark or google cloud dataflow), offered out of the box mechanisms for -splitting work a.k.a reading/writing manipulating - huge files (or datasources in general). The comment provided by David Anderson above proved of great value in hintintg at how Apache flink deals with this workflows.
At this point I've implemented solutions using huge files (for testing possible IO bottlenecks) with a "beam on flink" based pipeline, and I can confirm, that flink will create an excecution plan which includes splitting sources, and dividing work in such a way that no memory problem arises. Now, there can be of course conditions under which stability/"IO performance" is compromised, but at least I can confirm that the workflows carried out behind the pipeline abstraction, uses the filesystem when carriying out tasks, avoiding fitting all data in memory and thus avoiding trivial memory errors. Conclusion: yes "beam on flink" (and likely spark and dataflow too) do offer proper work preparation, work splitting and filesystem usage so that available volatile memory is used in an efficient way.
Update about datasources: Regarding DBs as datasources, Flink won't (and can't - it is not trivial) optimize/split/distribute work related to DB datasources in the same way it optimizes reading from the filesystem. There are still approaches to read huge amount of data (records) from a DB though, but the implementation details need to be addressed by the developer instead of being responsibility of the framework. I've found this article (https://nl.devoteam.com/expert-view/querying-jdbc-database-in-parallel-with-google-dataflow-apache-beam/) very helpful in addressing the point of reading massive amounts of records from a DB in beam (the article uses a cloud dataflow runner, but I used Flink and it worked just fine), splitting queries and distributing the processing.

Related

kiba-etl Pattern to split transformations into independent pipelines

Kiba is a very small library, and it is my understanding that most of its value is derived from enforcing a modular architecture of small independent transformations.
However, it seems to me that the model of a series of serial transformations does not fit most of the ETL problems we face. To explain the issue, let me give a contrived example:
A source yields hashes with the following structure
{ spend: 3, cost: 7, people: 8, hours: 2 ... }
Our prefered output is a list of hashes where some of the keys might be the same as those from the source, though the values might differ
{ spend: 8, cost: 10, amount: 2 }
Now, calculating the resulting spend requires a series of transformations: ConvertCurrency, MultiplyByPeople etc. etc. And so does calculating the cost: ConvertCurrencyDifferently, MultiplyByOriginalSpend.. Notice that the cost calculations depend on the original (non transformed) spend value.
The most natural pattern would be to calculate the spend and cost in two independent pipelines, and merge the final output. A map-reduce pattern if you will. We could even benefit from running the pipelines in parallel.
However in my case it is not really a question of performance (as the transformations are very fast). The issue is that since Kiba applies all transforms as a set of serial steps, the cost calculations will be affected by the spend calculations, and we will end up with the wrong result.
Does kiba have a way of solving this issue? The only thing I can think of is to make sure that the destination names are not the same as the source names, e.g. something like 'originSpend' and 'finalSpend'. It still bothers me however that my spend calculation pipeline will have to make sure to pass on the full set of keys for each step, rather than just passing the key relevant to it, and then merging in the Cost keys in the end. Or perhaps one can define two independent kiba jobs, and have a master job call the two and merge their result in the end? What is the most kiba-idiomatic solution to this?
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
I think I lack extra details to be able to properly answer your main question. I will get in touch via email for this round, and will maybe comment here later for public visibility.
Splitting an ETL pipeline into multiple parallel paths seem to be a key feature of most ETL tools, so I'm surprised that it doesn't seem to be something kiba supports?
The main focus of Kiba ETL today is: components reuse, lower maintenance cost, modularity and ability to have a strong data & process quality.
Parallelisation is supported to some extent though, via different patterns.
Using Kiba Pro parallel transform to run sister jobs
If your main input is something that you can manage to "partition" with a low volume of items (e.g. database id ranges, or a list of files), you can use Kiba Pro parallel transform like this:
source ... # something that generate list of work items
parallel_transform(max_threads: 10) do |group_items|
Kiba.run(...)
end
This works well if there is no output at all, or not much output, coming to the destinations of the sister jobs.
This works with threads but one can also "fork" here for extra performance.
Using process partitioning
In a similar fashion, one can structure their jobs in a way where each process will only process a subset of the input data.
This way one can start say 4 processes (via cron jobs, or monitored via a parent tool), and pass a SHARD_NUMBER=1,2,3,4, which is then used by the source for input-load partitioning.
But!
I'm pretty sure your problem, as you said, is more about workflow control & declarations & ability to express what you need to be done, rather than performance.
I'll reach out and we'll discuss that.

What to compute on school project "Distributed calculation"? (Useful)

Before you read my question: This topic fits to more than one StackExchange site (Mathematics, Software Recommendations, Software Engineering, Stackoverflow) so I putted it into most popular one. Move it please if you think it fits somewhere else better.
TL;DR: I need something useful what I can compute in simple distributed calculation app and what is not one of the most common things (DNA, fractals, ...)
End of semester is coming and I have an semestral work to do for subject "Distributed systems". The task is to make distributed system (across few physical devices connected by LAN). I have some options like distributed chat, shared variable, or what I prefer, distributed calculation.
My question is what can I compute on this. If I choose this topic I want it to be useful for something.
I do not have knowlege of biomedicine (to compute DNA), advanced mathematics (eg. fractals) or similar stuff for what are distributed systems used mostly.
Do you guys have some ideas?
PS: It is not important but I will code it most likely in Node.JS or Java
You can go with prime numbers calculation using brute force, i assume the value of your project is not in the efficiency of the algorithm, more on how you are distributing the calculation.
Something that would be really interesting could be to execute queries using distributed calculation. Depending on your familiarity with databases and on the time you can devote, you can support as many types of queries as you find challenging and interesting (e.g distributed join).
To elaborate, you will have a number of nodes and some data that will be partitioned across those nodes and you will have a client performing queries on all those data. Your system will be able to answer those queries by doing some local computation on each node and then combining the results in a meaningful way to return the final answer.
To sum up, your project would be a simplified distributed query engine.

How to send multiple timeframe data from MQL4 to a Node.js?

I am trying to get multiple time-frame data of different trading instrument ( _Symbol ) from MetaTrader4 Terminal to a node.
How can I do it?
Can we do it from the same EA inside a MetaTrader4 Terminal?
A.1: Yes, we can.
A.2: No, that initial idea is not a good one.
While the intention is clear, the idea to use a single EA to send live-data for multiple trading instruments is not working for the said interest well.
MQL4 code-execution environment has some fixed, hard-wired internal logic and due to these + plus due to the reality, how Capital Markets and Broker-type Market access mediators work, the solo-EA will never fit these requirements.
A simple call to
iOpen( aTradingInstrumentSymbolNAME, // iHigh, iLow, iClose, iVolume, iTime
aSelectedTimeFrameDefinedCODE,
aRelativeBarPTR
)
is by far not enough.
Professional solution will require a lot of care for a real-time handling capabilities, for unmasking the actual flow of mutually hiding events, for achieving minimalistic processing latencies, so a quite high engineering expertise will be needed.
Start with learning the basics about Scripts, benchmark all your critical code-sections with recording their actual durations in [us] and assure, your code will remain non-blocking under all circumstances. This will decide, whether more than one code-execution thread(s) will be necessary in prime-time / for peak-hour.
Having managed that, your way just started to lead in a direction towards your expected result.
Next one has to decide about a feasible inter-process / distributed-computing data-flow and signalling, needed for inter-platform integration.
Last, but not least, important point is the legal-side of such undertaking. It depends both on your local juri§$§$§diction and Broker's Terms & Conditions as no one would enjoy to celebrate a technically well mastered Project from inside of jail.
All that, quite an interesting Project.
iOpen(Symbol(),PERIOD_M1,1) - is the way to get data from M1 ( last bar ), if you need another timeframe - replace PERIOD_M1 with another ENUM_TIMEFRAMES. So what is the problem? Usually StackOverflow requires to see your MCVE-based example to help you.

Multiple task as one - Apache Spark

I have a software which proceeds one picture and give me some results for that picture and a database which contains a lot of pictures.
I would like to build a distributed architecture in order to process these pictures on multiple servers in order to gain time.
I heard about Spark and searched about it, but I'm not sure that this solution is good for me. Nevertheless, I don't want to miss something.
Indeed, in all the example I found for Spark, it's always dealing with tasks/jobs that can be split in smaller tasks/jobs.
For example, a text can be split in mutliple smaller texts and so, the wordcount can be easy processed.
However, when I use my software, I need to give a whole picture and not just parts of it.
So, is it possible to give Spark a task which contains 10 pictures (for example), and then Spark splits it in smaller tasks (1 task = 1 picture) and sends each picture to a worker?
And if it's possible, is this very efficient? I actually heard about Celery and I'm wondering if this kind of solution is better for my case.
Thank you for your help! :)
I think it depends on what you mean by "lot of pictures" and how often you will get "lot of pictures" to process. If you have tens of thousands of pictures and you will get them frequently, then Spark will definitely a good solution.
From an architecture and requirements viewpoint I think either Spark or Storm will fit the bill. My main concern would be wether the overhead is justified. This talk for instance is about realtime image processing with Spark:
https://www.youtube.com/watch?v=I6qmEcGNgDo
You could so look at this quora thread:
https://www.quora.com/Has-anyone-developed-computer-vision-image-processing-algorithms-on-Twitter-Storm

Java : Use Disruptor or Not . .

Hy,
Currently I am developing a program that takes 2 values from an amq queue and performs a series of mathematical calculations on them. A topic has been created on the amq server to which my program subscribes and receive messages via callbacks (listeners).
Now whenever a message arrives the two values are taken out of and added to the SynchronizedDescriptiveStatistics object. After each addition to the list of values the whole sequence of calculations is performed all over again (this is part of the requirement actually).
The problem I am facing right now is that since I am using listeners, sometimes a single or more messages are received in the middle of calculations. Although SynchronizedDescriptiveStatistics takes care of all the thread related issues it self but it adds all the waiting values in its list of numbers at once when it comes out of lock or something. While my problem was to add one value then perform calcls on it then second value and on and on.
The solution I came up with is to use job queues in my program (not amq queues). In this way whenever calcs are over the program would look for further jobs in the queue and goes on accordingly.
Since I am also looking for efficiency and speed I thought the Disruptor framework might be good for this problem and it is optimized for threaded situations. But I am not sure if its worth the trouble of implementing Disruptor in to my application because regular standard queue might be enough for what I am trying to do.
Let me also tell you that the data on which the calcs need to be performed is a lot and it will keep on coming and the whole calcs will need to be performed all over again for each addition of a single value in a continuous fashion. So keeping in mind the efficiency and the huge volume of data what do you think will be useful in the long run.
Waiting for a reply. . .
Regards.
I'll give our typical answer to this question: test first, and make your decision based on your results.
Although you talk about efficiency, you don't specifically say that performance is a fundamental requirement. If you have an idea of your performance requirements, you could mock up a simple prototype using queues versus a basic implementation of the Disruptor, and take measurements of the performance of both.
If one comes off substantially better than the other, that's your answer. If, however, one is much more effort to implement, especially if it's also not giving you the efficiency you require, or you don't have any hard performance requirements, then that suggests that solution is not the right one.
Measure first, and decide based on your results.

Resources