Cassandra with apache spark - apache-spark

I want to do some analytical queries and some range queries as well on the time series data stored in Cassandra. For tgis, I came across Apache Spark which supports all these stuffs. May I know any good tutorial/resources on how can I integrate apache spark with Cassandra and make queries ?
I am well familiar with Java/J2EE, SQL and CQL stuffs but not with scala.will it be good to learn scala for this poc ? I am using Cassandra 2.2.

Before reading theory, spend time in understanding the architecture.
Check good videos in Youtube
Once you understand architecture, get familiar with simple theory at Tutorials Point Cassandra,Tutorials Point Spark ,Apache Spark and Apache Cassandra
Then you can go through cassandra datastax tutorial to learn concepts in depth and proceed with integration of spark and cassandra

Here you go:
https://academy.datastax.com/courses/getting-started-apache-spark
You could have googled it.
Anyways the tutorial contains a very good explanation of spark and shows how spark work with cassandra.
Hope you learn what you are looking for. :)

Related

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

data visualization tools for apache cassandra

What are free tools for Apache Cassandra data visualization, with drag and drop function?
I worked with Apache Zeppelin and Kibana, any suggestion except those, pls?
Thanks!
You can try to connect Tableau (famous data visualization tool) with Cassandra. There is an intermediate layer which is the Spark Thrift Server (so you need a Spark cluster). The whole setup requires a bit of effort but is doable. You can easily find some tutorials / blog post on the internet, that will explain you how to achieve this.

Cassandra vs Druid

I have a use case where i had to analyze real time data using Apache Spark. But i still have a confusion related to choosing data store for my application. The analysis mostly include aggregation, KPI based identity analysis and machine learning tools to predict trends and analysis. Cassandra has good support and large tech companies are already using it in production. But after research i found Druid is faster than Cassandra and is good for OLAP queries but it's results are inconsistent of queries like Count Distinct.
Guys any help related that will be appreciated. Thanks
As your use case is to analyze real time data, I will suggest you to use Druid not Apache Cassandra. For Apache Cassandra, due to its asynchronous master less replication you could have missed the updated data in real time analyzing. On the other hand, Druid is designed for real time analyzing.
Druid Details: http://druid.io/druid.html
Apache Cassandra Details: https://en.wikipedia.org/wiki/Apache_Cassandra

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Data analytics on Cassandra

We are using Apache Cassandra to save data into. Except the spark what are the tools/technologies to perform the data analytics after reading data from cassandra. Spark is good but it needs a programmer(java/scala/python) to add/modify the future requirements which leads to high maintenance cost. What are the other alternatives?
If you want to go with Spark on top of Cassandra, many have accomplished good results with Cassandra, Hive, and Hadoop. Others have accomplished similar results using a mix of Cassandra, Hive, and Solr.
Another decent set of slides and tutorial for running analysis of data via Cassandra and Hadoop. You will find more in depth explanation of this via the PDF download on the provided page.
If you're interested in continuing to pursue Spark, you can evaluate DataStax Enterprise, which took the complexity out of it and allows you to run Spark right on top of Cassandra.
To answer your question, you have a few industry proven options... Primarily Hadoop and Hive.

Resources