Cassandra vs Druid - cassandra

I have a use case where i had to analyze real time data using Apache Spark. But i still have a confusion related to choosing data store for my application. The analysis mostly include aggregation, KPI based identity analysis and machine learning tools to predict trends and analysis. Cassandra has good support and large tech companies are already using it in production. But after research i found Druid is faster than Cassandra and is good for OLAP queries but it's results are inconsistent of queries like Count Distinct.
Guys any help related that will be appreciated. Thanks

As your use case is to analyze real time data, I will suggest you to use Druid not Apache Cassandra. For Apache Cassandra, due to its asynchronous master less replication you could have missed the updated data in real time analyzing. On the other hand, Druid is designed for real time analyzing.
Druid Details: http://druid.io/druid.html
Apache Cassandra Details: https://en.wikipedia.org/wiki/Apache_Cassandra

Related

Distributed Data Store - Hazelcast Vs Cassandra

We need to choose between HazelCast Or Cassandra as a distributed data store option. I have worked with cassandra but not with Hazelcast, will like to have a comparative analysis done features like :
Replication
Scalability
Availability
Data Distribution
Performance of reads/writes
Consistency
Will appreciate some help here to help us make the right choice.
The following page and the documents on the page might help on your decision: https://hazelcast.com/use-cases/nosql/apache-cassandra-replacement/
https://db-engines.com/en/system/Cassandra%3BHazelcast

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Change Capture from DB2 to Cassandra

I am trying to get all inserts, updates, deletes to a normalized DB2 database (hosted on an IBM Mainframe) synced to a Cassandra database. I also need to denormalize these changes before I write them to Cassandra so that the data structure meets my Cassandra model.
Searched on google but tools either lack processing support or streaming CDC support.
Is there any tool out there that can help me achieve the above?
It's likely that no stock tool exists. What's the format of the CDC stream coming out? What queries do you need to run? Like any other Cassandra data modeling question, start with the queries you need to run and work backwards to the table structure(s).

Data analytics on Cassandra

We are using Apache Cassandra to save data into. Except the spark what are the tools/technologies to perform the data analytics after reading data from cassandra. Spark is good but it needs a programmer(java/scala/python) to add/modify the future requirements which leads to high maintenance cost. What are the other alternatives?
If you want to go with Spark on top of Cassandra, many have accomplished good results with Cassandra, Hive, and Hadoop. Others have accomplished similar results using a mix of Cassandra, Hive, and Solr.
Another decent set of slides and tutorial for running analysis of data via Cassandra and Hadoop. You will find more in depth explanation of this via the PDF download on the provided page.
If you're interested in continuing to pursue Spark, you can evaluate DataStax Enterprise, which took the complexity out of it and allows you to run Spark right on top of Cassandra.
To answer your question, you have a few industry proven options... Primarily Hadoop and Hive.

What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

Can some experts give some succinct answers to the differences between Presto and Impala from these perspectives?
Fundamental architecture design
SQL compliance
Real-world latency
Any SPOF or fault-tolerance functionality
Structured and unstructured data use scenario performance
Apache Impala is a query engine for HDFS/Hive systems only.
PrestoDB, as well as the community version Trino, on the other hand are a generic query engine, which support HDFS as just one of many choices. There is a long list of connectors available, Hive/HDFS support is just one of them. This also means that you can query different data source in the same system, at the same time.

Resources