Improve reading speed of Cassandra in Spark (Parallel reads implementation) - apache-spark

I am new to Spark and trying to combine Cassandra and Spark to do some analytical tasks.
From the Spark web UI I found that most of the time are consumed in the reading process.
When I dig into this particular task, I found that only single executor is working on it.
Is it possible to improve the performance of this task via some tricks like parallelization?
p.s. I am using the pyspark cassandra connector (https://github.com/TargetHolding/pyspark-cassandra).
UPDATE: I am using a 3-node Spark cluster running Spark 1.6 and a 3-node Cassandra cluster running Cassandra 2.2.4.
And I am selecting data in the form of
"select * from tbl where partitionKey IN [pk_1,pk_2,....,pk_N] where
clusteringKey > ck_1 and clusteringKey < ck_2"
UPDATE2: Ive read an article suggesting to replace the IN clause with parallel reads. (https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/) How can this be achieved in spark?

Will able to answer to point, if you provide more details about cluster, spark and Cassandra versions and related stuff.Though I will try to answer it as per my understanding.
Make sure you are partitioning RDD parallelized-collections
If your spark job is running on only single executor, please verify spark submit command.you can get more details about spark submit commands here as per your cluster manager.
For speeding up Cassandra read operations, make use of proper indexing. I will recommend use of Solr, which will help you in fast data retrieval from Cassandra.

Related

How Spark can speed up bulk loading to JanusGraph?

I need to load lots of vertices and edges to JanusGraph with Cassandra backend from other storage. I've read about bulk loading and Spark configuring (https://docs.janusgraph.org/advanced-topics/bulk-loading/ and https://docs.janusgraph.org/advanced-topics/hadoop/) .
It's clear how to configure JanusGraph for Spark usage but I'm still not sure how to use Spark then and if Spark can help to speed up inserting into graph.
Please give some use cases and code example of using Hadoop MapReduce or Spark to speed up bulk loading data to Janusgraph (Java or Python are preferred). Any help welcome!
I worked on POC project recently to Bulk Load data into JanusGraph using Apache Spark. We were getting pretty good performance loading data into using Spark. Setup and sample code is provided in the article below.
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-ace7d146af05
https://medium.com/#nitinpoddar/bulk-loading-data-into-janusgraph-part-2-ca946db26582
Alternatively, you can write a Kafka consumer application to load data from your Kafka to JanusGraph. The amount of parallelism will be restricted to the number of partitions of the source/input topic from which your application is reading data. The application will be single-threaded but you can scale your application to the number of input topics. Each instance of your application can open up a connection and write to JanusGraph using a transaction. You can batch transactions with some batch size to spread the load.

Cassandra(with Hadoop) performance with Spark

We are running Spark/Hadoop on a different set of nodes than Cassandra. We have 10 Cassandra nodes and multiple spark cores but Cassandra is not running on Hadoop. Performance in fetching data from Cassandra through spark(in yarn client mode) is not very good and bulk data reads from HDFS are faster(6 mins in Cassandra to 2 mins in HDFS). Changing Spark-Cassandra parameters is not helping much also.
Will deploying Hadoop on top of Cassandra solve this issue and majorly impact read performance ?
Without looking at your code, bulk reads in an analytics/Spark capacity, are always going to be faster when directly going to the file VS. reading from a database. The database offers other advantages such as schema enforcement, availability, distribution control, etc but I think the performance differences you're seeing are normal.

Spark Kafka Structured Streaming integration with Apache Ignite

Right now there is no way by which i can save spark DataFrames in Apche Ignite. It will get included in Apache Ignite 2.2 version as mentioned here https://issues.apache.org/jira/browse/IGNITE-3084. I am using Structured Streaming API of Apache Spark with Kafka for consuming data. I want to do some aggregations like average value for a particular column or min-max value on consumed data.
My question is whether i should use Spark SQL DataFrame API to do above mentioned aggregations or should i wait for Apache Ignite 2.2 version ? They have mentioned it in documentation that Ignite SQL is 100s faster than Spark SQL.
Actually, it's up to you. You could go ahead with Spark now, then wait for DataFrames support in Ignite is ready, compare these two approaches and choose which fits your needs better.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

3 nodes cassandra with one being a spark master - to solve geospatial data or geographic data

I am looking for directions:
I have a cassandra database with latitude & longitude data. I need to search for data within a radius or a box coordinates around a point. I am using golang(gocql) client to query Cassandra.
I need some understanding regarding Spark and Cassandra as this seams like the way to go.
Is the following assumptions correct; I have 2 Cassandra nodes(the data in a replica of 2).
Should I then install an extra node and install Spark on this and then connect it to the other two existing Cassandra nodes containing the data(With the Spark Connector from DataStax).
And do the two existing Cassandra nodes need to have Spark workers installed on them to work with Spark Master node?
When the Spark setup is in place, do you query(Scala) the existing data and then save the data onto the Spark node and then query this with the gaoling(gocql) client?
Any directions is welcome
Thanks in advance
Geospatial Searching is a pretty deep topic. If it's just doing searches that you're after (not batch/analytics), I can tell you that you probably don't want to use Spark. Spark isn't very good at 'searching' for data - even when it's geospatial. The main reason is that Spark doesn't index data for efficient searches and you'd have to create a job/context (unless using job server) every time you'd want to do a search. That takes forever when you're thinking in terms of user facing application time.
Solr, Elastic Search, and DataStax Enterprise Search (Disclaimer I work for DataStax) are all capable of box and radius searches on Cassandra data and do so in near real time.
To answer your original question though, if the bulk of your analytics in general come from Cassandra data, it may be good idea to run Spark on the same nodes as Cassandra for data locality. The nice thing is that Spark scales quite nicely, so if you find Spark taking too many resources from Cassandra, you can simply scale out (both Cassandra and Spark).
Should I then install an extra node and install Spark on this and then
connect it to the other two existing Cassandra nodes containing the
data(With the Spark Connector from DataStax).
Spark is a cluster compute engine so it needs a cluster of nodes to work well. You'll need to install it on all nodes if you want it to be as efficient as possible.
And do the two existing Cassandra nodes need to have Spark workers
installed on them to work with Spark Master node?
I don't think they 'have' to have them, but it's a good idea for locality. There's a really good video on academy.datastax.com that shows how the spark cassandra connector reads data from Cassandra to Spark. I think it will clear a lot of things up for you: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
When the Spark setup is in place, do you query(Scala) the existing
data and then save the data onto the Spark node and then query this
with the gaoling(gocql) client?
The Spark-Cassandra connector can communicate to both Cassandra and Spark. There are methods, saveToCassandra(), for example, that will write data back to Cassandra your jobs are processed. Then you can use your client as you normally would.
There are some really good free Spark + Cassandra tutorials at academy.datastax.com. This is also a good place to start: http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/

Resources