How to enable streaming from Cassandra to Spark? - apache-spark

I have the following spark job:
from __future__ import print_function
import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("PySpark Cassandra Test")
sc = CassandraSparkContext(conf=conf)
stream = StreamingContext(sc, 2)
rdd=sc.cassandraTable("keyspace2","users").collect()
#print rdd
stream.start()
stream.awaitTermination()
sc.stop()
When I run this, it gives me the following error:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute
the shell script I run:
./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py
Comparing spark streaming with kafka, I have this line missing from the above code:
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})
where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10

To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.
Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.
See also: Reading from Cassandra using Spark Streaming for an example.

Related

Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’

The problem is that the pyspark script runs fine in one cluster, the error occurs when I run the same pyspark script in another yarn cluster. I guess the spark environment configurations are different between two cluster.
Here is the code to initialize the spark session.
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
hive_context = HiveContext(sc)
error
Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’

Is it possible to limit resources assigned to a Spark session?

I'm launching pySpark sessions with the following code:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import *
spark = SparkSession.builder.getOrCreate()
I've noticed that if a notebook is running a pySpark query, and a second notebook tries to start a Spark session, the second Spark session will not start until the first one has finished (i.e. the first session is taking all the resources).
Is there some way to limit the resources of a Spark session or parallelize multiple sessions somehow?

Error when creating sqlContext in Apache Spark

I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.

Specifiying custom profilers for pyspark running Spark 2.0

I would like to know how to specify a custom profiler class in PySpark for Spark version 2+. Under 1.6, I know I can do so like this:
sc = SparkContext('local', 'test', profiler_cls='MyProfiler')
but when I create the SparkSession in 2.0 I don't explicitly have access to
the SparkContext. Can someone please advise how to do this for Spark 2.0+ ?
SparkSession can be initialized with an existing SparkContext, for example:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.profiler import BasicProfiler
spark = SparkSession(SparkContext('local', 'test', profiler_cls=BasicProfiler))

how to connect spark streaming with cassandra?

I'm using
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
and cassandra is listening on
rpc_address:127.0.1.1
rpc_port:9160
For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,4)
map1={'topic_name':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.
Same way, I want spark streaming to listen to cassandra and output the contents of the specified table every say 4 seconds.
How to convert the above streaming code to make it work with cassandra instead of kafka?
The non-streaming solution
I can obviously keep running the query in an infinite loop but that's not true streaming right?
spark job:
from __future__ import print_function
import time
import sys
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
from pyspark.streaming import *
sc = SparkContext(appName="sparkcassandra")
while(True):
time.sleep(5)
sqlContext = SQLContext(sc)
stream=StreamingContext(sc,4)
lines = stream.socketTextStream("127.0.1.1", 9160)
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="keyspace2")\
.load()\
.show()
run like this
sudo ./bin/spark-submit --packages \
datastax:spark-cassandra-connector:1.4.1-s_2.10 \
examples/src/main/python/sparkstreaming-cassandra2.py
and I get the table values which rougly looks like
lastname|age|city|email|firstname
So what's the correct way of "streaming" the data from cassandra?
Currently the "Right Way" to stream data from C* is not to Stream Data from C* :) Instead it usually makes much more sense to have your message queue (like Kafka) in front of C* and Stream off of that. C* doesn't easily support incremental table reads although this can be done if the clustering key is based on insert time.
If you are interested in using C* as a streaming source be sure to check out and comment on
https://issues.apache.org/jira/browse/CASSANDRA-8844
Change Data Capture
Which is most likely what you are looking for.
If you are actually just trying to read the full table periodically and do something you may be best off with just a cron job launching a batch operation as you really have no way of recovering state anyway.
Currently Cassandra is not natively supported as a streaming source in Spark 1.6, you must implement a custom receiver for your own case(listen to cassandra and output the contents of the specified table every say 4 seconds.).
Please refer to the implementation guide:
Spark Streaming Custom Receivers

Resources