Spark goodness with Cassandra? - apache-spark

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!

The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5

Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.

Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Related

Kubernetes Vs Spark Vs Spark on kubernetes

So I have a use case where I will stream about 1000 records per minute from kafka. I just need to dump these records in raw form in a no sql db or something like a data lake for that matter
I ran this through two approaches
Approach 1
——————————
Create kafka consumers in java and run them as three different containers in kubernetes. Since all the containers are in the same kafka consumer group, they would all contribute towards reading from same kafka topic and dump data into data lake. This works pretty quick for the volume of work load I have
Approach 2
——————————-
I then created a spark cluster and the same java logic to read from kafka and dump data in data lake
Observations
———————————-
Performance of kubernetes if not bad was equal to that of a spark job running in clustered mode.
So my question is, what is the real use case for using spark over kubernetes the way I am using it or even spark on kubernetes?
Is spark only going to rise and shine much much heavier work loads let’s say something of the order of 50,000 records per minute or cases where some real time processing needs to be done on the data before dumping it to the sink?
Spark has more cost associated to it so I need to make sure I use it only if it would scale better than kuberbetes solution
If your case is only to archive/snapshot/dump records I would recommend you to look into the Kafka Connect.
If you need to process the records you stream, eg. aggregate or join streams, then Spark comes into the game. Also for this case you may look into the Kafka Streams.
Each of these frameworks have its own tradeoffs and performance overheads, but in any case you save much development efforts using the tools made for that rather than developing your own consumers. Also these frameworks already support most of the failures handling, scaling, and configurable semantics. Also they have enough config options to tune the behaviour to most of the cases you can imagine. Just choose the available integration and you're good to go! And of course beware the open source bugs ;) .
Hope it helps.
Running kafka inside Kubernetes is only recommended when you have a lot of expertise doing it, as Kubernetes doesn't know it's hosting Spark, and Spark doesn't know its running inside Kubernetes you will need to double check for every feature you decide to run.
For your workload, I'd recommend sticking with Kubernetes. The elasticity, performance, monitoring tools and scheduling features plus the huge community support adds well on the long run.
Spark is a open source, scalable, massively parallel, in-memory execution engine for analytics applications so it will really spark when your load become more processing demand. It simply doesn't have much room to rise and shine if you are only dumping data, so keep It simple.

How to handle backpressure on databases when using Apache Spark?

We are using Apache Spark for performing ETL for every 2 hours.
Sometimes Spark puts much pressure on databases when read/write operation is performed.
For Spark Streaming, I can see backpressure configuration on kafka.
Is there a way to handle this issue in batch processing?
Backpressure is actually just a fancy word to refer to setting up the max receiving rate. So actually it doesn't work the way you think it does.
What should be done here is actually on the reading end.
Now in classical JDBC usage, jdbc connectors have a fetchSize property for PreparedStatements. So basically you can consider configuring that fetchSize with regards of what is said in the following answers :
Spark JDBC fetchsize option
What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
Unfortunately, this might not solve all of your performance issues with your RDBMS.
What you must know is that compared to the basic jdbc reader, which run on a single worker, when partitioning data using an integer column or using a sequence of predicates, loading data in a distributed mode but introduce a couple of problems. In your case, high number of concurrent reads can easily throttle the database.
To deal with this, I suggest the following :
If available, consider using specialized data sources over JDBC
connections.
Consider using specialized or generic bulk import/export tools like Postgres COPY or Apache Sqoop.
Be sure to understand performance implications of different JDBC data source
variants, especially when working with production database.
Consider using a separate replica for Spark jobs.
If you wish to know more about Reading data using the JDBC source, I suggest you read the following :
Spark SQL and Dataset API.
Disclaimer: I'm the co-author of that repo.

3 nodes cassandra with one being a spark master - to solve geospatial data or geographic data

I am looking for directions:
I have a cassandra database with latitude & longitude data. I need to search for data within a radius or a box coordinates around a point. I am using golang(gocql) client to query Cassandra.
I need some understanding regarding Spark and Cassandra as this seams like the way to go.
Is the following assumptions correct; I have 2 Cassandra nodes(the data in a replica of 2).
Should I then install an extra node and install Spark on this and then connect it to the other two existing Cassandra nodes containing the data(With the Spark Connector from DataStax).
And do the two existing Cassandra nodes need to have Spark workers installed on them to work with Spark Master node?
When the Spark setup is in place, do you query(Scala) the existing data and then save the data onto the Spark node and then query this with the gaoling(gocql) client?
Any directions is welcome
Thanks in advance
Geospatial Searching is a pretty deep topic. If it's just doing searches that you're after (not batch/analytics), I can tell you that you probably don't want to use Spark. Spark isn't very good at 'searching' for data - even when it's geospatial. The main reason is that Spark doesn't index data for efficient searches and you'd have to create a job/context (unless using job server) every time you'd want to do a search. That takes forever when you're thinking in terms of user facing application time.
Solr, Elastic Search, and DataStax Enterprise Search (Disclaimer I work for DataStax) are all capable of box and radius searches on Cassandra data and do so in near real time.
To answer your original question though, if the bulk of your analytics in general come from Cassandra data, it may be good idea to run Spark on the same nodes as Cassandra for data locality. The nice thing is that Spark scales quite nicely, so if you find Spark taking too many resources from Cassandra, you can simply scale out (both Cassandra and Spark).
Should I then install an extra node and install Spark on this and then
connect it to the other two existing Cassandra nodes containing the
data(With the Spark Connector from DataStax).
Spark is a cluster compute engine so it needs a cluster of nodes to work well. You'll need to install it on all nodes if you want it to be as efficient as possible.
And do the two existing Cassandra nodes need to have Spark workers
installed on them to work with Spark Master node?
I don't think they 'have' to have them, but it's a good idea for locality. There's a really good video on academy.datastax.com that shows how the spark cassandra connector reads data from Cassandra to Spark. I think it will clear a lot of things up for you: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
When the Spark setup is in place, do you query(Scala) the existing
data and then save the data onto the Spark node and then query this
with the gaoling(gocql) client?
The Spark-Cassandra connector can communicate to both Cassandra and Spark. There are methods, saveToCassandra(), for example, that will write data back to Cassandra your jobs are processed. Then you can use your client as you normally would.
There are some really good free Spark + Cassandra tutorials at academy.datastax.com. This is also a good place to start: http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/

HBase or Cassandra?

In my lambda architecture, i am debating on whether to use HDFS or Cassandra to store my immutable data. I need Cassandra to serve the online requests etc. so it is the mandatory part of the tech stack. Now, I do not want to introduce new tool (HDFS) into the stack if I don't have to. So my question is, what will I be missing if I don't use HDFS and use Cassandra to host my immutable data as well.
EDIT:
I understand HDFS is a distributed filesystem and Cassandra is NoSQL DB. Still, both support data replication, both support high-throughput writes. In addition Cassandra supports low latent data retrieval. So am I right saying that HDFS isn't going to provide me much lift?
As I understand You are trying to clarify your Serving Layer of your Lambda Architecture.
If it is true, you want to store your batch views and real-time views into a Database.
And as I understand you do not have Hadoop cluster in your batch layer.
And your batch views have not been completed in HDFS.
At this point your architecture is outside of HDFS.
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.
If you dont want a hadoop cluster, omit HBase.
Cassandra is distributed NoSQL Database(column-oriented) and it works outside the Hadoop cluster and HDFS.
If I understand your architecture and your needs right, I think Cassandra is best for you.
Additionally, you can get quick info about Lambda architecture from this link;
http://artofbigdata.blogspot.com.tr/2016/01/lambda-architecture.html
HDFS supports different file formats to store. For example, sequence files, Avro and Parquet etc..so that you can choose a file format suitable to your application needs.
Also note that you can efficiently read the data using SQL-like queries.
So different data models are available in HDFS over Cassandra to host the data.

Data analytics on Cassandra

We are using Apache Cassandra to save data into. Except the spark what are the tools/technologies to perform the data analytics after reading data from cassandra. Spark is good but it needs a programmer(java/scala/python) to add/modify the future requirements which leads to high maintenance cost. What are the other alternatives?
If you want to go with Spark on top of Cassandra, many have accomplished good results with Cassandra, Hive, and Hadoop. Others have accomplished similar results using a mix of Cassandra, Hive, and Solr.
Another decent set of slides and tutorial for running analysis of data via Cassandra and Hadoop. You will find more in depth explanation of this via the PDF download on the provided page.
If you're interested in continuing to pursue Spark, you can evaluate DataStax Enterprise, which took the complexity out of it and allows you to run Spark right on top of Cassandra.
To answer your question, you have a few industry proven options... Primarily Hadoop and Hive.

Resources