Need architecture hint: Data replication into the cloud + data cleansing

Need architecture hint: Data replication into the cloud + data cleansing - apache-spark

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?

From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.

The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).

Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !

The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

web real time analytics dashboard: which technologies should use? (node/django, cassandra/mongodb...)

we want to develop a dashboard to analyze geospatial data.
This is a small and close approach to what we want to do: http://adilmoujahid.com/images/data-viz-talkingdata.gif
Our main concerns are about the backend technologies to be used. (front will be D3.js, DC.js, leaflet.js...)
Between Django and node.js, we think that we will use node.js, cause we've read than its faster than Django for this kind of tasks. But we are not sure and we are open to ideas.
But about Mongo or Cassandra, we are so confused. Our data is mostly structured, so store it in tables like Cassandra would make it easy to manage, also Cassandra seems to have better performance. However, we also have IoT devices data, with lots of real-time GPS location...
Which suggestions can you give to us to achieve our goal?
TL;DR Summary;
Dashboard with hundreds of simultaneous users.
Stored data will be mostly structured text/numbers, but will include also images, GPS-arrays, IoT sensors, geographical data (vector-polygons & rasters)
Databases will receive high write load coming from sensors.
Dashboard performance is so important. Its more important to read data in real time, than keeping it uncorrupted/secure.
Most calculus/math will be calculated in the client's browser, the server will try to avoid mathematical operations.

Disclaimer: I'm a DataStax employee so I'll comment on the Cassandra piece.
Cassandra is a good choice for this if your dashboard can be planned around a set of known queries. If those users will be doing ad-hoc queries directly to the database from the dashboard, you'll want something with a little more flexibility like ElasticSearch or (shameless plug) DataStax Search. Especially if you expect the queries/database to handle some of the geospatial logic.

JaguarDB has very strong support of geospatial data (2D and 3D). It allows you to store multi-measurements per point location while other databases support only one measurement (pointm). Many complex queries such as Voronoi polygon, convexhull are also supported. It is open source, distributed and sharded, multiple columns indexes, etc.

Concerning Postgresql and Cassandra, is there much difference in RAM/CPU/DISK usage between them?
Our use case does not require transactions, it will be in a single node and we will have IoT devices writing data up to 500 times per second. However ive read that Geographical data that works better with Potstgis than cassandra...
According to this use case, do you recommend Cassandra or Postgis?

Spark: Is there a way to use resources in each of the local machine where is distributed?

I have to admit that I don't know how to formulate properly a title question for this (any help is appreciated), but I'll try to be more clear here:
I would like to distribute a task with Spark, but I would like to use exclusively some resources. There are no restrictions on the order of the dataset processed, but I wish that every batch distributed and analysed in different nodes of the clusters use different resources.
I will give an example that, hopefully, will make the question clearer:
Immagine I have to analyse 10MLN text messages for a task of sentiment analysis. The sentiment analysis is provided by a Webserver that is able to analyse a batch of 100 messages in 100ms via API accessible with credentials. Since I don't want to waste weeks for analyse them all, the idea is to distribute the task. But I cannot distribute the SAME credential because I would incur in RateLimit, or overload.
The desirable solution would be to use ONE credential per partition in Spark, or per node. How can I do that given that the credentials might change, so they are not fixed for nodes?

Which tools to use when migrating bounded data?

I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?

Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.

Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

Streaming Big Data - where to store intermediate results?

I am working on spark streaming job that requires to store intermediate results in order to reuse them in next window stream. Number of data is extremely large so probably there is no way to store it in spark cache. What is more I need in someway to read data by some 'key'.
I was thinking about Cassandra as intermediate storage but it also has some drawbacks.
Alternatively, maybe Kafka will be do the job but it will require additional work in order to select given portion of data by key.
Could you advise me what I should do?
How such problems are resolved in Storm - is there any internal mechanism or it is preferred to use some external tools?

Solr as Index + Cassandra as NoSQL storage working fine for my use case where I have to process tera bytes of data. But in my case, I am using Cassandra for persistent storage of years of data.
Kafka is working fine as a replacement Jboss/AMQ due to it's simple architecture. Currently I am working Apache Storm + Kafka for real time stream processing in one of the projects.
Since you are storing intermediate data, I think Kafka is best choice by setting right retention period.
Have a look at one more SE Question and other article

As you mention, Kafka has some problems getting items by key. It really only provides APIs for FIFO paradigm. I would advise to use a dedicated storage software, Cassandra, MongoDB, I even seen Solr used to store text. It would be easier to use something designed for key retrieval rather than try to modify Kafka yourself and most likely introduce bugs/issues that could take forever to solve.
As SQL.injection said, you'll have to manage the storage and logic by yourself. Storm doesn't offer such a mechanism.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string