Which tools to use when migrating bounded data?

Which tools to use when migrating bounded data? - apache-spark

I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?

Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.

Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

Related

Spring Cloud DataFlow http polling and deduplication

I have been reading much Spring Cloud DataFlow and related documentation in order to produce a data ingest solution that will run in my organization's Cloud Foundry deployment. The goal is to poll an HTTP service for data, perhaps three times per day for the sake of discussion, and insert/update that data in a PostgreSQL database. The HTTP service seems to provide 10s of thousands of records per day.
One point of confusion thus far is a best practice in the context of a DataFlow pipeline for deduplicating polled records. The source data do not have a timestamp field to aid in tracking polling, only a coarse day-level date field. I also have no guarantee that records are not ever updated retroactively. The records appear to have a unique ID, so I can dedup the records that way, but I am just not sure based on the documentation how best to implement that logic in DataFlow. As far as I can tell, the Spring Cloud Stream starters do not provide for this out-of-the-box. I was reading about Spring Integration's smart polling, but I'm not sure that's meant to address my concern either.
My intuition is to create a custom Processor Java component in a DataFlow Stream that performs a database query to determine whether polled records have already been inserted, then inserts the appropriate records into the target database, or passes them on down the stream. Is querying the target database in an intermediate step acceptable in a Stream app? Alternatively, I could implement this all in a Spring Cloud Task as a batch operation which triggers based on some schedule.
What is the best way to proceed with respect to a DataFlow app? What are common/best practices for achieving deduplication as I described above in a DataFlow/Stream/Task/Integration app? Should I copy the setup of a starter app or just start from scratch, because I am fairly certain I'll need to write custom code? Do I even need Spring Cloud DataFlow, because I'm not sure I'll be using its DSL at all? Apologies for all the questions, but being new to Cloud Foundry and all these Spring projects, it's daunting to piece it all together.
Thanks in advance for any help.

You are on the right track, given your requirements you will most likely need to create a custom processor. You need to keep track of what has been inserted in order to avoid duplication.
There's nothing preventing you from writing such processor in a stream app, however performance may take a hit, since for each record you will issue a DB query.
If order is not important, you could parallelize the query so you could process several concurrent messages, but in the end your DB would still pay the price.
Another approach would to use a bloomfilter that can help quite a lot on speeding up your checking for inserted records.
You can start by cloning the starter apps, you could have a poller trigger an http client processor that fetches your data and then go through your custom code processor and finally to a jdbc-sink. Something like stream create time --triger.cron=<CRON_EXPRESSION> | httpclient --httpclient.url-expression=<remote_endpoint> | customProcessor | jdbc
One of the advantages of using SCDF is that you could independently scale your custom processor via deployment properties such as deployer.customProcessor.count=8

Spring Cloud Data Flow builds integration streams for data based on the Spring Cloud Stream, which, in turn, is fully based on the Spring Integration. And all the principles exist in Spring Integration can be applied everywhere there on the SCDF level.
That really might be a case that you won't be able to avoid some codding, but what you need is called in EIP Idempotent Receiver. And Spring Integration provides one for us:
#ServiceActivator(inputChannel = "processChannel")
#IdempotentReceiver("idempotentReceiverInterceptor")
public void handle(Message<?> message)

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?

From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.

The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).

Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing.
Looking at the Beam word count example, it feels it is very similar to the native Spark/Flink equivalents, maybe with a slightly more verbose syntax.
I currently don't see a big benefit of choosing Beam over Spark/Flink for such a task. The only observations I can make so far:
Pro: Abstraction over different execution backends.
Con: This abstraction comes at the price of having less control over what exactly is executed in Spark/Flink.
Are there better examples that highlight other pros/cons of the Beam model? Is there any information on how the loss of control affects performance?
Note that I'm not asking for differences in the streaming aspects, which are partly covered in this question and summarized in this article (outdated due to Spark 1.X).

There's a few things that Beam adds over many of the existing engines.
Unifying batch and streaming. Many systems can handle both batch and streaming, but they often do so via separate APIs. But in Beam, batch and streaming are just two points on a spectrum of latency, completeness, and cost. There's no learning/rewriting cliff from batch to streaming. So if you write a batch pipeline today but tomorrow your latency needs change, it's incredibly easy to adjust. You can see this kind of journey in the Mobile Gaming examples.
APIs that raise the level of abstraction: Beam's APIs focus on capturing properties of your data and your logic, instead of letting details of the underlying runtime leak through. This is both key for portability (see next paragraph) and can also give runtimes a lot of flexibility in how they execute. Something like ParDo fusion (aka function composition) is a pretty basic optimization that the vast majority of runners already do. Other optimizations are still being implemented for some runners. For example, Beam's Source APIs are specifically built to avoid overspecification the sharding within a pipeline. Instead, they give runners the right hooks to dynamically rebalance work across available machines. This can make a huge difference in performance by essentially eliminating straggler shards. In general, the more smarts we can build into the runners, the better off we'll be. Even the most careful hand tuning will fail as data, code, and environments shift.
Portability across runtimes.: Because data shapes and runtime requirements are neatly separated, the same pipeline can be run in multiple ways. And that means that you don't end up rewriting code when you have to move from on-prem to the cloud or from a tried and true system to something on the cutting edge. You can very easily compare options to find the mix of environment and performance that works best for your current needs. And that might be a mix of things -- processing sensitive data on premise with an open source runner and processing other data on a managed service in the cloud.
Designing the Beam model to be a useful abstraction over many, different engines is tricky. Beam is neither the intersection of the functionality of all the engines (too limited!) nor the union (too much of a kitchen sink!). Instead, Beam tries to be at the forefront of where data processing is going, both pushing functionality into and pulling patterns out of the runtime engines.
Keyed State is a great example of functionality that existed in various engines and enabled interesting and common use cases, but wasn't originally expressible in Beam. We recently expanded the Beam model to include a version of this functionality according to Beam's design principles.
And vice versa, we hope that Beam will influence the roadmaps of various engines as well. For example, the semantics of Flink's DataStreams were influenced by the Beam (née Dataflow) model.
This also means that the capabilities will not always be exactly the same across different Beam runners at a given point in time. So that's why we're using capability matrix to try to clearly communicate the state of things.

I have a disadvantage, not a benefit. We had a leaky abstraction problem with Beam: when an issue needs to be debugged, we need to learn about the underlying runner and its API, Flink in this case, to understand the issue. This doubles the learning curve, having to learn about Beam and Flink at the same time. We ended up later switching the later developed pipelines to Flink.

Helpful information can be found here - https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
---Quoted---
Beam provides a unified API for both batch and streaming scenarios.
Beam comes with native support for different programming languages, like Python or Go with all their libraries like Numpy, Pandas, Tensorflow, or TFX.
You get the power of Apache Flink like its exactly-once semantics, strong memory management and robustness.
Beam programs run on your existing Flink infrastructure or infrastructure for other supported Runners, like Spark or Google Cloud Dataflow.
You get additional features like side inputs and cross-language pipelines that are not supported natively in Flink but only supported when using Beam with Flink

DB access from a Mapper in MapReduce

I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.
For the analysis, the mappers must have DB access.
So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.
How should I handle that?
Is it something I need to concern about?

As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.
Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.
Some ideas:
Periodically dump the meta-data to flat file/s into distributed file system
Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes
It will depend on the trade-offs you are able to make for your given use-case.
If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.

WHat is the best method to fetch social media data?

Hey I am a new guy to big data. I am making a system which will fetch data from social media and process the result, for this I am using apache spark.
Following is the flow of my model:
user will save the desired keywords using a webpage made in php.
with those key words I would be fetching data from social media,
processing the data(ex, sentiments and views) and then provide it to
the end user.
Now my confusion is how should I fetch data from social media. using
apache kafka
apache flume
or by directly calling the API twitter4j(just an example).
Though I have to learn to implement all three data fetching techniques, and If I happen to use direct api then I can skip the whole hadoop part. It would be great if you guys could suggest me which one is better.
All of the above I am doing on a local machine. I have completed the ui part now I am in the phase where I have to fetch data.
Thanks.

I guess I will make this a suggestion.
You may not want to fetch data from any source using distributed system, unless you plan to DDoS their production server. If your cluster is setup behind one router, your whole cluster may be blacklisted because all nodes consistently hit the access rate limit that's adding up at your router, depending on whether the server is powerful or not. Twitter server doesn't care about 100 threads to be honest (provided you know what you are doing), but any startup will probably get to you right away.
If you have a workstation with 4 cores, having it up catching streaming data should suffice for initial stage of academic research. Or if you really want tons of data, you can perhaps do Hadoop streaming with your fetcher script as mapper and no reducer, quick and easy. If you are superstar in Java or Scala, get a fetching thread on each vcore on Spark's executor.
Now, Twitter has the REST API, which means you can pretty much fetch data in any programming language. Of course sometimes it may be easier to use existing interfaces, assuming they are well-maintained, they are almost always more robust. But I get lazy all the time. For example, I sometimes just want a sample data point, so I just pipe curl into jq to check what I want to check.
Yes, learn about jq too, will save you tons of time. And be a gentleman who doesn't DDoS people.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string