Is it possible to run Shark queries over Spark Streaming data? - apache-spark

Is it possible to run Shark queries over the data contained in the DStreams of a Spark Streaming application? (for istance inside a foreachRDD call)
Are there any specific API to do that?
Thanks.

To answer my question if someone is worried about the same problem:
the direct answer to my question is NO, you cannot run Shark directly on Spark Streaming data.
Spark SQL is currently a valid alternative, at least it was for my needs.
It is included in Spark and doesn't require more configuration, you can have a look at it here: http://spark.apache.org/docs/latest/sql-programming-guide.html

Related

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Filtering from IgniteRDD happen locally in Spark Application or in Ignite Server?

If I execute a Filter on IgniteRDD, then the filter is pushed-down to Ignite Server, or first the Spark RDD should first collect all the data and then execute the filter within Spark Application?
There is no collect at all, but as far as I know there is a distinction between to cases:
Plain filter will use standard Spark execution.
sql will be processed by Ignite itself without Spark usage.
It all depends on Catalyst Optimizer. You can check the plans to understand your pipeline and see where is it executed. Also debugging might help.
As it explains here - IgniteRDD is an implementation of Spark RDD to represent Ignite cache and use spark API. As example there shows - filter would operate on cache directly.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

Improve reading speed of Cassandra in Spark (Parallel reads implementation)

I am new to Spark and trying to combine Cassandra and Spark to do some analytical tasks.
From the Spark web UI I found that most of the time are consumed in the reading process.
When I dig into this particular task, I found that only single executor is working on it.
Is it possible to improve the performance of this task via some tricks like parallelization?
p.s. I am using the pyspark cassandra connector (https://github.com/TargetHolding/pyspark-cassandra).
UPDATE: I am using a 3-node Spark cluster running Spark 1.6 and a 3-node Cassandra cluster running Cassandra 2.2.4.
And I am selecting data in the form of
"select * from tbl where partitionKey IN [pk_1,pk_2,....,pk_N] where
clusteringKey > ck_1 and clusteringKey < ck_2"
UPDATE2: Ive read an article suggesting to replace the IN clause with parallel reads. (https://ahappyknockoutmouse.wordpress.com/2014/11/12/246/) How can this be achieved in spark?
Will able to answer to point, if you provide more details about cluster, spark and Cassandra versions and related stuff.Though I will try to answer it as per my understanding.
Make sure you are partitioning RDD parallelized-collections
If your spark job is running on only single executor, please verify spark submit command.you can get more details about spark submit commands here as per your cluster manager.
For speeding up Cassandra read operations, make use of proper indexing. I will recommend use of Solr, which will help you in fast data retrieval from Cassandra.

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Resources