DB access from a Mapper in MapReduce - apache-spark

I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.
For the analysis, the mappers must have DB access.
So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.
How should I handle that?
Is it something I need to concern about?

As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.
Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.
Some ideas:
Periodically dump the meta-data to flat file/s into distributed file system
Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes
It will depend on the trade-offs you are able to make for your given use-case.
If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.

Related

Apache Pulsar - use cases for infinite retention of a topic

I am actually planing our next version of our telemetry system architecture. I am strongly considering Pulsar at the messaging solution.
To better understand what's this technology is best for, can someone share their use cases of why their use the infinite retention of a topic other than audit trail ?
I was main goal is to see if our telemetry data could be simply stored in a pulsar topic and query that for analytics purpose instead of using a time series database like Apache Druid.
Thanks !
The use-case I've had for infinite retention is when you want to store the history going back to the beginning: e.g. in an event-sourcing style approach, the longer you're keeping the events archived, the more able you are to remix your state.
With durable-log style storage, remember that it heavily optimizes for slurping the log starting at some point. For higher-volume queries or queries with strict latency requirements, this is generally pretty unsuited for that sort of workload, and even more so if you can't limit reads to a single partition (remember also that with multiple partitions, even the ordering of the messages in the log may be difficult to reconstruct). For infrequent queries with loose latency requirements, though, storing them in pulsar might not be that bad, especially if you'd be using pulsar already to feed data into the time-series store (as you can then dispense with the time-series store).

Do we need a cache layer for cassandra?

I am playing/evaluating a realtime streaming application with millions of user activity logs, the design was to use Cassandra as a persistent store and use redis as a cache layer to store recent activities (last 1000). I am looking for a suggestion whether such a cache layer necessary along with cassandra. Is cassandra capable to get best read and write performance? The activities are streamed to front end as pages of 10 or 15 records. Suggestions are expected to use any alternative noSQL solutions as well
It depends a lot on your requirements - Cassandra is reasonably fast for most common purposes, but redis will be faster, so having a caching layer is a reasonable and common approach. It's not strictly necessary, but it's not a bad idea.

Which tools to use when migrating bounded data?

I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?
Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.
Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

Hazelcast Jet - Use Cases

What are the use-cases of Hazelcast Jet? Has anyone started using it?
Our project uses Hazelcast for Distributed Map holding Key-Value pair and Distributed computing on those Keys to run the task at the node holding the Key. We use NearCache solution as well.
I was curious to know how different is Hazelcast Jet and what problems does it solve?
As of current version (0.3), Jet's advantage over just submitting a Runnable to each partition is the ability to perform grouping by a key other than the one used in the Hazelcast map. For this to work in a distributed environment you have to send each item to the processing unit responsible for its grouping key, and this is something that is easy to get from Jet.
Further from that, you can build a multistage cascade of groupBy operations, you can have forks in your data stream to reuse the same intermediate result in more than one way, you can build a pipeline where an I/O task distributes the processing of the data it reads across all CPU cores, etc... in short, all the advantages that a full-blown DAG computation engine offers.
By the time it reaches 1.0 Jet will also support fault-tolerant infinite stream processing, event time-based windows, and more.
2021 answer for use cases:
Change data capture streaming - Use Debezium/Hazelcast to detect changes to your database and stream to other microservices (if data is common), stream changes to a data lake, or update a search engine
Real time analytics - Take market data stream and perform technical analysis in realtime or twitter analysis
Async job processing - PDF conversion service

When does it make sense to use a database for a job queue?

I'm designing an internal company web site where users can submit jobs for computation. An important factor in my design is to persist the job in the queue until it's completed, even if there is a system failure.
It seems the internet is against the idea as it's "not really the purpose of a database" and better suited for a key/value store like Redis (or a job queue that makes use of Redis, like Kue for Node.js). I think I get it in the sense that the purpose of this design is to not overburden the database with read/writes for fairly transient data as you would find in a job queue. In my use case though database use would be pretty low and it seems the persistence of data that a database offers is the key feature I'm looking for here.
In my reading I've found that some key/value stores, like Redis, have a persist function but it's not really built to make sure all data is recoverable in case the system goes down.
Am I missing something here or does this sound about right?
In my reading I've found that some key/value stores, like Redis, have
a persist function but it's not really built to make sure all data is
recoverable in case the system goes down.
In Redis, data is persisted from memory to disk using a background thread. Actually a system down won't corrupt your database, but the issue is that a system can go down before or during a snapshot to disk and you'll lose any data that was created after the last successful snapshot.
If you can be sure that your Redis server will be up and running 99.9% of the time, this isn't a big problem, but anyway, the problem exists.
At the end of the day, my best advise is you should use the right tool for the job: a generalist database, either NoSQL or SQL, isn't made for job enqueueing. Use an existing tool that it does this already like RabbitMQ.

Resources