I am implementing a parallelized data processing system that involves a bunch of conversions and filters of data as it moves through multiple stages. I recognize the Apache Commons Pipeline project as a good fit for this requirement, but Apache Camel seems to provide a superset of that functionality. Does Camel replace the Commons Pipeline?
Apache Camels goal is more to be a mediator/routing engine in distributed systems and systems integration. That said, as you notice, it is very lightweight and could easily serve as an engine for parallellized execution of data flows. I don't think you should see camel as a replacement, rather an alternative.
Related
I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?
Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.
Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".
I have some extra security considerations from a normal job. I usually use sbt to build and I will give it some libraries to grab from a Maven repository. But now, I'm unable to use a lot of external libraries, and I'm unsure at this point if I will be able to go out to Maven to get the Spark libraries that I might need. Even if I were to get the external libraries, there would be a vetting process that would take months for each library. Has anyone been in a similar situation? From the standpoint of not being able to use external libraries, can anyone share what they did to have a successful suite of Spark jobs to do there data munging and data science on a hadoop cluster?
I think there isn't a standard solution for your problem within the context you exposed. It depends on how much you go with external dependencies and what you really need. And I give you an example: parsing csv rows and construct dataframe/datasets or rdd. You have plenty of options:
use external library (from databricks or others)
rely on your code and do it by hand, so no external dependency
rely on spark newer versions that knows how to deal with csv
If you have a hadoop cluster than all the spark runtime environment already contains plenty of libraries that will be loaded (json manipulation, networking, logging, just to name a few). Most of your business logic inside your spark jobs can be done with those.
I give you some examples on how I have approached the problem with external dependency although I did'n have any security constraints. In one case we had to use Spring dependency within our Spark application (cause we wanted to update some relation tables), so we got a fat jar with all spring dependencies and they were many. Conclusion: got a lot of dependency for nothing (horror maintaining it :) ). So that was not a good approach. In other case we had to do the same thing, but then we kept the dependency at minimum (the most simple thing that can read/update a table with a jdbc). Conclusion: the fat jar was not that big, we kept only what was really needed, nothing more nothing less.
Spark already provides you with a lot of functionalities. Knowing a external library that can do something does not mean that spark can't do it with what is has.
Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.
Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.
I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far:
Pro: Abstraction over different execution backends.
Con: This abstraction comes at the price of having less control over what exactly is executed in Spark/Flink.
Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance?
Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).
There's a few things that Beam adds over many of the existing engines.
Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.
APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.
Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.
Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.
Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.
And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model.
This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things.
I have a disadvantage, not a benefit. We had a leaky abstraction problem with Beam: when an issue needs to be debugged, we need to learn about the underlying runner and its API, Flink in this case, to understand the issue. This doubles the learning curve, having to learn about Beam and Flink at the same time. We ended up later switching the later developed pipelines to Flink.
Helpful information can be found here - https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
---Quoted---
Beam provides a unified API for both batch and streaming scenarios.
Beam comes with native support for different programming languages, like Python or Go with all their libraries like Numpy, Pandas, Tensorflow, or TFX.
You get the power of Apache Flink like its exactly-once semantics, strong memory management and robustness.
Beam programs run on your existing Flink infrastructure or infrastructure for other supported Runners, like Spark or Google Cloud Dataflow.
You get additional features like side inputs and cross-language pipelines that are not supported natively in Flink but only supported when using Beam with Flink
Can someone provide me some sample Java APIs that are yet to be implemented in Apache Spark.I am trying to see if there are any Scala Spark APIs that "do not exist/have limited functionality" if I decide to use the Java APIs rather.
That would be a deal-breaker for me.
Disclaimer:
Based on my googling/analysis I realize that Scala community support for Apache Spark is really good.Also I understand that in order to work efficiently with Spark you need to learn some Scala anyway(As source code is in Scala).
Optimistic point of view:
Consider that:
The standard Scala backend is a Java VM. Scala classes are Java classes, and vice versa. You can call the methods of either language from methods in the other one. You can extend Java classes in Scala, and vice versa. The main limitation is that some Scala features do not have equivalents in Java, for example traits.
Conclusion - there is no missing API
Pessimistic point of view:
Spark is written in Scala has Scala-centric API and is not Java friendly. There multiple packages (like GraphX) which have no Java friendly API. You need code like this once in a while.
I am trying to implement Kafka - Spark environment. I am trying to debug my MapReduce logic (implemented in Java). Spark submit step is making it complicated to debug with break points in my algorithms. Incoming live data patterns are complex. It will be very time consuming process to simulate the complex algorithms. Better development environment will help developers to validate their map reduce logic on live stream data.
Please suggest any tips and tricks. Is it possible to have IDE breakpoints, or remote debugging with Apache-Spark.
I don't think it matters if you are developing streaming or batch spark application. You can always use intellij idea for graphical debugging of your application.
Also look at this video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ. In the end if you want to debug reactivity of your application according to data patterns I suggest to use Spark's internal tools to see for example how DAG is being created or how it's being distributed.
One thing I myself am working on is to use spark debugging tools, and build a log according to my application execution graph with added information that I get from profilers (usual OS tools like iotop or iostat) to find for example where I am not utilizing resources enough.
In the end you need these information together to make a decision and ironically it itself can become a data intensive application.