Streaming data from Kafka to Hazelcast and persisting it into cassandra - cassandra

Let me expose the architecture of my system before diving into the heart of the problem.
I Have stream of data that comes from Kafka and my company uses a distributed cache (hazelcast precisely) that make data ready to be requested through web services that we expose. We also want to persist the data in the cache to cassandra so it would be durable. I have two solutions on how to put the data to hazelcast and I would like to have your suggestions (maybe another way of doing) and tell me in your view what's the best solution and why?
1/ use a kafka-hazelcast connector to send data directly from kafka to hazelcast and then persist the data to cassadandra using write-behind and mapstores ==> there two main drawbacks with this solution, first we to serialize/deserialize each time we store data to cassandra (important usage of CPU) and second we put all the data to the cache even not needed by users (we have lots of evictions hapenning)
2/ Use a kafka-cassandra connector and write data directly to cassandra and then find a means (how complex you think this part could be ?) to notify hazelcast to update/evict the data if it's already in the cache ==> the pros of this solution is that we get rid of the serilizatino/deserialization needed by the mapstores and we load only the data that was queried before and the key is already in the cache
Which one of the two solutions do you prefer and why ?
what's the best means to notify hazelcast in the second solution in you point of view ?
Thank you in advance for your suggestions/answers
I hope i was concise and clear !

Related

Do we need a cache layer for cassandra?

I am playing/evaluating a realtime streaming application with millions of user activity logs, the design was to use Cassandra as a persistent store and use redis as a cache layer to store recent activities (last 1000). I am looking for a suggestion whether such a cache layer necessary along with cassandra. Is cassandra capable to get best read and write performance? The activities are streamed to front end as pages of 10 or 15 records. Suggestions are expected to use any alternative noSQL solutions as well
It depends a lot on your requirements - Cassandra is reasonably fast for most common purposes, but redis will be faster, so having a caching layer is a reasonable and common approach. It's not strictly necessary, but it's not a bad idea.

Need architecture hint: Data replication into the cloud + data cleansing

I need to sync customer data from several on-premise databases into the cloud. In a second step, the customer data there needs some cleanup in order to remove duplicates (of different types). Based on that cleansed data I need to do some data analytics.
To achieve this goal, I'm searching for an open source framework or cloud solution I can use for. I took a look into Apache Apex and Apache Kafka, but I'm not sure whether these are the right solutions.
Can you give me a hint which frameworks you would use for such an task?
From my quick read on APEX it requires Hadoop underneath coupling to more dependencies than you probably want early on.
Kafka on the other hand is used for transmitting messages (it has other APIs such as streams and connect which im not as familiar with).
Im currently using Kafka to stream log files in real time from a client system. Out of the box Kafka really only provides fire and forget semantics. I have had to add a bit to make it an exactly once delivery semantic (Kafka 0.11.0 should solve this).
Overall, think of KAFKA being a more low level solution with logical message domains with queues and from what I skimmed over APEX being a more heavy packaged library with alot more things to explore.
Kafka would allow you to switch out the underlying analytical system of your choosing with their consumer api.
The question is very generic, but I'll try to outline a few different scenarios, as there are many parameters in play here. One of them is cost, which on the cloud it can quickly build up. Of course, the size of data is also important.
These are a few things you should consider:
batch vs streaming: do the updates flow continuously, or the process is run on demand/periodically (sounds the latter rather than the former)
what's the latency required ? That is, what's the maximum time that it would take an update to propagate through the system ? Answer to this question influences question 1)
how much data are we talking about ? If you're up the Gbyte size, Tbyte or Pbyte ? Different tools have different 'maximum altitude'
and what format ? Do you have text files, or are you pulling from relational DBs ?
Cleaning and deduping can be tricky in plain SQL. What language/tools are you planning on using to do that part ? Depending on question 3), data size, deduping usually requires a join by ID, which is done in constant time in a key value store, but requires a sort (generally O(nlogn)) in most other data systems (spark, hadoop, etc)
So, while you ponder all this questions, if you're not sure, I'd recommend you start your cloud work with an elastic solution, that is, pay as you go vs setting up entire clusters on the cloud, which could quickly become expensive.
One cloud solution that you could quickly fire up is amazon athena (https://aws.amazon.com/athena/). You can dump your data in S3, where it's read by Athena, and you just pay per query, so you don't pay when you're not using it. It is based on Apache Presto, so you could write the whole system using basically SQL.
Otherwise you could use Elastic Mapreduce with Hive (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html). Or Spark (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html). It depends on what language/technology you're most comfortable with. Also, there are similar products from Google (BigData, etc) and Microsoft (Azure).
Yes, you can use Apache Apex for your use case. Apache Apex is supported with Apache Malhar which can help you build application quickly to load data using JDBC input operator and then either store it to your cloud storage ( may be S3 ) or you can do de-duplication before storing it to any sink. It also supports Dedup operator for such kind of operations. But as mentioned in previous reply, Apex do need Hadoop underneath to function.

DB access from a Mapper in MapReduce

I planning the next generation of an analysis system I'm developing and I think of implementing it using one of the MapReduce/Stream-Processing platforms like Flink, Spark Streaming etc.
For the analysis, the mappers must have DB access.
So my greatest concern is when a mapper is paralleled, the connections from the connection pool will all be in use and there might be a mapper that fail to access the DB.
How should I handle that?
Is it something I need to concern about?
As you have pointed out, a pull-style strategy is going to be inefficient and/or complex.
Your strategy for ingesting the meta-data from the DB will be dictated by the amount of meta-data and the frequency that the meta-data changes. Either way, moving away from fetching the meta-data when it's needed, and toward receiving updates when the meta-data is changed, is likely to be a good approach.
Some ideas:
Periodically dump the meta-data to flat file/s into distributed file system
Streaming meta-data updates to your pipeline at write-time to keep an in-memory cache up-to-date
Use a separate mechanism to fetch the meta-data, for instance Akka Actor/s polling for changes
It will depend on the trade-offs you are able to make for your given use-case.
If DB interactivity is unavoidable, I do wonder if map-reduce style frameworks would be the best approach to solve your problem. But any failed tasks should be retried by the framework.

Streaming Big Data - where to store intermediate results?

I am working on spark streaming job that requires to store intermediate results in order to reuse them in next window stream. Number of data is extremely large so probably there is no way to store it in spark cache. What is more I need in someway to read data by some 'key'.
I was thinking about Cassandra as intermediate storage but it also has some drawbacks.
Alternatively, maybe Kafka will be do the job but it will require additional work in order to select given portion of data by key.
Could you advise me what I should do?
How such problems are resolved in Storm - is there any internal mechanism or it is preferred to use some external tools?
Solr as Index + Cassandra as NoSQL storage working fine for my use case where I have to process tera bytes of data. But in my case, I am using Cassandra for persistent storage of years of data.
Kafka is working fine as a replacement Jboss/AMQ due to it's simple architecture. Currently I am working Apache Storm + Kafka for real time stream processing in one of the projects.
Since you are storing intermediate data, I think Kafka is best choice by setting right retention period.
Have a look at one more SE Question and other article
As you mention, Kafka has some problems getting items by key. It really only provides APIs for FIFO paradigm. I would advise to use a dedicated storage software, Cassandra, MongoDB, I even seen Solr used to store text. It would be easier to use something designed for key retrieval rather than try to modify Kafka yourself and most likely introduce bugs/issues that could take forever to solve.
As SQL.injection said, you'll have to manage the storage and logic by yourself. Storm doesn't offer such a mechanism.

Cassandra as an embedded service and with custom consistency level

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post
Is the following implementation possible and what are known pitfalls (defects, functional limitations)?
1) Run Cassandra as an embedded service, persisting data to disk (durable).
2) Java application interacts with local embedded service via one of the following. What are the pros
TMemoryBuffer (or something more appropriate?)
StorageProxy (what are the pitfalls of using this API?)
Apache Avro? (see question #5 below)
3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).
4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?
5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?
As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.
Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

Resources