I have some extra security considerations from a normal job. I usually use sbt to build and I will give it some libraries to grab from a Maven repository. But now, I'm unable to use a lot of external libraries, and I'm unsure at this point if I will be able to go out to Maven to get the Spark libraries that I might need. Even if I were to get the external libraries, there would be a vetting process that would take months for each library. Has anyone been in a similar situation? From the standpoint of not being able to use external libraries, can anyone share what they did to have a successful suite of Spark jobs to do there data munging and data science on a hadoop cluster?
I think there isn't a standard solution for your problem within the context you exposed. It depends on how much you go with external dependencies and what you really need. And I give you an example: parsing csv rows and construct dataframe/datasets or rdd. You have plenty of options:
use external library (from databricks or others)
rely on your code and do it by hand, so no external dependency
rely on spark newer versions that knows how to deal with csv
If you have a hadoop cluster than all the spark runtime environment already contains plenty of libraries that will be loaded (json manipulation, networking, logging, just to name a few). Most of your business logic inside your spark jobs can be done with those.
I give you some examples on how I have approached the problem with external dependency although I did'n have any security constraints. In one case we had to use Spring dependency within our Spark application (cause we wanted to update some relation tables), so we got a fat jar with all spring dependencies and they were many. Conclusion: got a lot of dependency for nothing (horror maintaining it :) ). So that was not a good approach. In other case we had to do the same thing, but then we kept the dependency at minimum (the most simple thing that can read/update a table with a jdbc). Conclusion: the fat jar was not that big, we kept only what was really needed, nothing more nothing less.
Spark already provides you with a lot of functionalities. Knowing a external library that can do something does not mean that spark can't do it with what is has.
Related
I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?
Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.
Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".
Read from a few places which indicate mapstore and maploader must be running with the Hazelcast Node. Would like to find out if there any ways to implement mapstore/maploader separate from the Hazelcast Node?
Background:
If i have a hazelcast cluster for the team, and this cluster is to be use by different sub-team providing different map as data, and each sub-team should implement mapstore/maploader for the map they own, how can this be done? (Note that each sub-team have their own SVN repository)
Thanks in advance~
MapLoader's load() operation is only invoked on the node that would have the key when the key is missing, so there is no way to push this processing elsewhere.
However, each map can have a different MapStore/MapLoader implementation, so having a different team provide each is certainly feasible.
Exactly how you achieve this comes down to your build and deploy practices. For example, each team's classes could be in a separate jar file on the classpath. Or, there could be a single jar file constructed containing the classes provided by each team. Many ways exist!
I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.
Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.
Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.
I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far:
Pro: Abstraction over different execution backends.
Con: This abstraction comes at the price of having less control over what exactly is executed in Spark/Flink.
Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance?
Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).
There's a few things that Beam adds over many of the existing engines.
Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.
APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.
Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.
Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.
Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.
And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model.
This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things.
I have a disadvantage, not a benefit. We had a leaky abstraction problem with Beam: when an issue needs to be debugged, we need to learn about the underlying runner and its API, Flink in this case, to understand the issue. This doubles the learning curve, having to learn about Beam and Flink at the same time. We ended up later switching the later developed pipelines to Flink.
Helpful information can be found here - https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
---Quoted---
Beam provides a unified API for both batch and streaming scenarios.
Beam comes with native support for different programming languages, like Python or Go with all their libraries like Numpy, Pandas, Tensorflow, or TFX.
You get the power of Apache Flink like its exactly-once semantics, strong memory management and robustness.
Beam programs run on your existing Flink infrastructure or infrastructure for other supported Runners, like Spark or Google Cloud Dataflow.
You get additional features like side inputs and cross-language pipelines that are not supported natively in Flink but only supported when using Beam with Flink
From Puppet Best Practices:
The Puppet Labs documentation describes modules as self-contained bundles of code and data.
Ok it's clear.
A single module can easily manage a single application.
So, puppetlabs-apache manages Apache only, puppetlabs-mysql manages MySQL only.
.... So, my module my_company-mediawiki manages Mediawiki only (i suppose... with database and virtual host... because a module is self-contained bundles of code and data).
Modules are most effective when the serve a single purpose, limit dependencies, and concern themselves only with managing system state relating to their named purpose.
But my_company-mediawiki needs to depend by:
puppetlabs-mysql: to create database;
puppetlabs-apache: to manage a virtual host.
And... from a rapid search I understand that many modules refer to other modules.
But...
They provide complete functionality without creating dependencies on any other modules, and can be combined as needed to build different application stacks.
Ok, a good module is self-contained and has no dependencies.
So I have to necessarily use the pattern roles and profiles to accomplish these best practices? Or I'm confused...
The Puppet documentation's description of modules as self-contained is more aspirational than definitive. Don't read too much into it, or into others' echoes of it. Modules are quite simply Puppet's next level of code organization above classes and defined types, incorporating also plug-ins and owned data.
Plenty of low-level modules indeed have no cross-module dependencies, but such dependencies inescapably arise when you start forming aggregations at a level between that and whole node configurations. There is nothing inherently wrong with that. The Roles & Profiles pattern is a good way to structure such aggregations, but it is not the only way, and in any case it does not avoid cross-module dependencies because role and profile classes, like any other, should themselves belong to modules.