Adding [hazelcast-jet] to existing Application

I have an existing application that uses Hazelcast for tracking cluster membership and for distributed task execution. I'm thinking that Jet could be useful for adding analytics on top of the existing application, and I'm trying to figure out how best to layer Jet on top of what we already have.
So my first question is, how should run Jet on top of our existing Hazelcast configuration? Do I have to run Jet separately, or replace our existing Hazelcast configuration with Jet (since Jet does expose the HazelcastInstance.)
My second question is, I see lots of examples using IMap and IList, but I'm not seeing anything that uses topics as a source (I also don't see this as an option from the Sources builder). My initial thought on using Jet was to emit events (io perf data, http request data) from our existing code to a topic and then have Jet process that topic, generate analytics from that data, and then push that to an IMap. Is this the wrong approach? Should I be using some other structure to push these events into Jet? I saw that I can make my own custom Source where I could do this, but I felt that I must be going down the wrong path if I was pursuing this given there wasn't one already provided by the library for this specific purpose.

You can either upgrade your current Hazelcast IMDG cluster to a Jet cluster and run your legacy application alongside Jet jobs. This setup is simpler to deploy and operate. Starting an extra cluster for Jet is also perfectly fine. The advantage of it is isolation (cluster lifecycle, failures etc.). Just be aware that you can't combine IMDG 3.x with Jet 4.x clusters.
Use IMap with Journal to connect two jobs or to ingest data into the cluster. It's simplest fault-tolerant option that works OOTB. Jet's data source must be replayable - if the Job fails, it goes back to last state snapshot rewinding the data source offset respectively.
Topic can be used (via Source Builder) but it won't be fault-tolerant (some messages might get lost). Jet achieves fault tolerance by snapshotting the job regularly. In the case of failure, latest snapshot is restored and the data following the snapshot is replayed. Unlike journal, topic consumer can't replay the data using an offset.


Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

How Do I monitor progess and recover in a long-running Spark map job?

We're using Spark to run an ETL process by which data gets loaded in from a massive (500+GB) MySQL database and converted into aggregated JSON files, then gets written out to Amazon S3.
My question is two-fold:
This job could take a long time to run, and it would be nice to know how that mapping is going. I know Spark has a built in log manager. Is it as simple as just putting a log statement inside of each map? I'd like to know when each record gets mapped.
Suppose this massive job fails in the middle (maybe it chokes on a DB record or the MYSQL connection drops). Is there an easy way to recover from this in Spark? I've heard that caching/checkpointing can potentially solve this, but I'm not sure how?
Seems like 2 questions with lost of answers and detail. Anyway, assuming non-SPARK Streaming answer and referencing other based on my own reading / research, a limited response:
The following on logging progress checking of stages, tasks, jobs:
Global Logging via log4j and tailoring of this by using under the template stored under SPARK_HOME/conf folder, the template file which serves as a basis for defining logging requirements for ones own purposes but at SPARK level.
Programmtically by using Logger, using import org.apache.log4j.{Level, Logger}.
REST API to get status of SPARK Jobs. See this enlightening blog:
There is also a Spark Listener that can be used
:http://:8080 to see progress via Web UI.
Depends on type of failure. Graceful vs. non-graceful, fault tolerance aspects or memory usage issues and things like serious database duplicate key errors depending on API used.
See How does Apache Spark handles system failure when deployed in YARN? SPARK handles its own failures by looking at DAG and attempting to reconstruct a partition by re-execution of what is needed. This all encompasses aspects under fault tolerance for which nothing needs to be done.
Things outside of SPARK's domain and control mean it's over. E.g. memory issues that may result from exceeding various parameters on at large scale computations, DF JDBC write against a store with a duplicate error, JDBC connection outages. This means re-execution.
As an aside, some aspects are not logged as failures even though they are, e.g. duplicate key inserts on some Hadoop Storage Managers.

Avoiding re-processing of data during Spark Structured Streaming application updates

I am using Structured Streaming with Spark 2.2. We are using Kafka as our source and are using checkpoints for failure recovery and e2e exactly once guarantees. I would like to get some more information on how to handle updates to the application when there is a change in stateful operations and/or output schema.
As some of the sources suggest I can start the updated application parallel to the old application until it catches up with the old application in terms of data, and then kill the old one. But then the new application will have to re-read/re-process all the data in Kafka which could take a long time.
I want to avoid this re-processing of the data in the newly deployed updated application.
One way I can think of is for the application to keep writing the offsets into something in addition to the checkpoint directory, for example in zookeeper/hdfs. And then, on an update of the application, I command Kafka readstream() to start reading from the offsets stored in this new location (zookeeper/hdfs) - since the updated application can't read from the checkpoint directory which is now deemed incompatible.
So a couple of questions:
Is the above-stated solution a valid solution?
If yes, How can I automate the detection of whether the application is being restarted because of a failure/maintenance or because of code changes to stateful operations and/or output schema?
Any guidance, example or information source is appreciated.

Memsql Spark-Kafka Transform Failure

We have a Spark Cluster running under Memsql, We have different Pipelines running, The ETL setup is as below.
Extract:- Spark read Messages from Kafka Cluster (Using Memsql Kafka-Zookeeper)
Transform:- We have a custom jar deployed for this step
Load:- Data from Transform stage is Loaded in Columnstore
I have below doubts:
What Happens to the Message polled from Kafka, if the Job fails in Transform stage
- Does Memsql takes care of loading that Message again
- Or, the data is Lost
If the data gets Lost, how can I solve this Problem, is there any configuration changes which needs to done for this?
As it stands, at least once semantics are not available in MemSQL Ops. It is on the roadmap and will be present in one of the future releases of Ops.
If you haven't yet, you should check out MemSQL 5.5 Pipelines.
This one isn't based on spark, (and transforms are done a bit differently so you might have to rewrite your code), but we have native kafka streams now.
The way we get exactly once with the native version is simple; store the offsets in the database same atomic transaction as the actual data. If something fails and the transaction isn't committed, the offsets won't be committed, so we'll naturally and automatically retry that partition-offset-range.

how to integrate cassandra with zookeeper to support transactions

I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
Zookeeper creates znodes to perform read and write operations and data to and fro goes through znodes in Zookeeper. I want to know that how to support rollback and commit feature in cassandra using Zookeeper. Is there any way by which we can specify cassandra configurations in zookeeper or zookeeper configurations in cassandra.
I know cassandra and zookeeper individually how data is read and written but I dont know how to integrate both of them using Java.
how can we do transactions in Cassandra using Zookeeper.
I have a Cassandra cluster and Zookeeper server installed. Now I want to support transactions in cassandra using zookeeper. How do i do that.
With great difficulty. Cassandra does not work well as a transactional system. Writes to multiple rows are not atomic, there is no way to rollback writes if some writes fail, and there is no way to ensure readers read a consistent view when reading.
I want to know that how to support rollback and commit feature in cassandra using Zookeeper.
Zookeeper won't help you with this, especially the commit feature. You may be able to write enough information to zookeeper to roll back in case of failure, but if you are doing that, you might as well store the rollback info in cassandra.
Zookeeper and Cassandra work well together when you use Zookeeper as a locking service. Look at the Cages library. Use zookeeper to co-ordinate read/writes to cassandra.
Trying to use cassandra as a transactional system with atomic commits to multiple rows and rollbacks is going to be very frustrating.
There are ways you can use to implement transactions in Cassandra without ZooKeeper.
Cassandra itself has a feature called Lightweight transactions which provides per key linearizability and compare-and-set. With such primitives you can implement serializable transactions on the application level by youself.
Please see the Visualization of serializable cross shard client-side transactions post for for details and step-by-step visualization.
The variants of this approach are used in Google's Percolator system and in CockroachDB.
By the way, if you're fine with Read Committed isolation level then it makes sense to take a look on the RAMP transactions paper by Peter Bailis.
There is a BATCH feature for Cassandra's CQL3 (Cassandra 1.2 is the formal version that released CQL3), which allegedly can atomically apply all the updates in the BATCH as one unit all-or-nothing.
This does not mean you can rollback a successfully executed BATCH as an RDBMS could do, that would have to be manually done.
Depending on the consistency and preferences you provide to the BATCH statement, guarantees of atomicity of the updates can be increased or decreased to some degree with the UNLOGGED option.
Well, I'm not an exepert at this (far from it actually) but the way I see it, either you deploy some middleware made by yourself, in order to guarantee the specific properties you are looking for or you can just have Cassandra write data to auxilliary files and then copy them through the file system, since the copy function in Java works as an atomic operation.
I don't know anything about the size of the data files you are considering so I don't really know if it is doable, however there might be a way to use this property through smaller bits of information and then combining them as a whole.
Just my 2 cents...
