how to connect spark streaming with cassandra? - apache-spark

I'm using
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
and cassandra is listening on
rpc_address:127.0.1.1
rpc_port:9160
For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,4)
map1={'topic_name':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1)
And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.
Same way, I want spark streaming to listen to cassandra and output the contents of the specified table every say 4 seconds.
How to convert the above streaming code to make it work with cassandra instead of kafka?
The non-streaming solution
I can obviously keep running the query in an infinite loop but that's not true streaming right?
spark job:
from __future__ import print_function
import time
import sys
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext
from pyspark.streaming import *
sc = SparkContext(appName="sparkcassandra")
while(True):
time.sleep(5)
sqlContext = SQLContext(sc)
stream=StreamingContext(sc,4)
lines = stream.socketTextStream("127.0.1.1", 9160)
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="keyspace2")\
.load()\
.show()
run like this
sudo ./bin/spark-submit --packages \
datastax:spark-cassandra-connector:1.4.1-s_2.10 \
examples/src/main/python/sparkstreaming-cassandra2.py
and I get the table values which rougly looks like
lastname|age|city|email|firstname
So what's the correct way of "streaming" the data from cassandra?

Currently the "Right Way" to stream data from C* is not to Stream Data from C* :) Instead it usually makes much more sense to have your message queue (like Kafka) in front of C* and Stream off of that. C* doesn't easily support incremental table reads although this can be done if the clustering key is based on insert time.
If you are interested in using C* as a streaming source be sure to check out and comment on
https://issues.apache.org/jira/browse/CASSANDRA-8844
Change Data Capture
Which is most likely what you are looking for.
If you are actually just trying to read the full table periodically and do something you may be best off with just a cron job launching a batch operation as you really have no way of recovering state anyway.

Currently Cassandra is not natively supported as a streaming source in Spark 1.6, you must implement a custom receiver for your own case(listen to cassandra and output the contents of the specified table every say 4 seconds.).
Please refer to the implementation guide:
Spark Streaming Custom Receivers

Related

How to restart pyspark streaming query from checkpoint data?

I am creating a spark streaming application using pyspark 2.2.0
I am able to create a streaming query
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StreamingApp") \
.getOrCreate()
staticDataFrame = spark.read.format("parquet")\
.option("inferSchema","true").load("processed/Nov18/")
staticSchema = staticDataFrame.schema
streamingDataFrame = spark.readStream\
.schema(staticSchema)\
.option("maxFilesPerTrigger",1)\
.format("parquet")\
.load("processed/Nov18/")
daily_trs=streamingDataFrame.select("shift","date","time")
.groupBy("date","shift")\
.count("shift")
writer = df.writeStream\
.format("parquet")\
.option("path","data")\
.option("checkpointLocation","data/checkpoints")\
.queryName("streamingData")\
.outputMode("append")
query = writer.start()
query.awaitTermination()
The query is streaming and any additional file to "processed/Nov18" will be processed and stored to "data/"
If the streaming fails I want to restart the same query
Path to solution
According to official documentation I can get an id that can be used to restart the query
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html?highlight=streamingquery#pyspark.sql.streaming.StreamingQuery.id
The pyspark.streaming module contains StreamingContext class that has classmethod
classmethod getActiveOrCreate(checkpointPath, setupFunc)
https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext.getOrCreate
can these methods be used somehow?
If anyone has any use case of production ready streaming app for reference ?
You should simply (re)start the pyspark application with the checkpoint directory available and Spark Structured Streaming does the rest. No changes required.
If anyone has any use case of production ready streaming app for reference ?
I'd ask on the Spark users mailing list.

Restart streaming query without stopping application

Im trying to restart streaming query in spark using below code inplace of query.awaitTermination(),below code will be inside an infinite loop and looks for trigger to restart query and then executes below code.Basically im trying to refresh cached df.
query.processAllavaialble()
query.stop()
//oldDF is a cached Dataframe created from GlobalTempView which is of size 150GB.
oldDF.unpersist()
val inputDf: DataFrame = readFile(spec, sparkSession) //read file from S3
or anyother source
val recreateddf = inputDf.persist()
//Start the query// here should i start query again by invoking readStream ?
But when i looked into spark documentation it says
void processAllAvailable() ///documentation says This method is intended for testing/// Blocks until all available data in the source has been processed and committed to the sink. This method is intended for testing. Note that in the case of continually arriving data, this method may block forever. Additionally, this method is only guaranteed to block until data that has been synchronously appended data to a Source prior to invocation. (i.e. getOffset must immediately reflect the addition).
stop() Stops the execution of this query if it is running. This method blocks until the threads performing execution has stopped.
So whats the better way to restart query without stopping my spark streaming application
This has worked for me.
Below is the scenario which I followed in spark 2.4.5 for left outer join and left join.Below process is pushing spark to read latest dimension data changes.
Process is for Stream Join with batch dimension (always update)
Step 1:-
Before starting Spark streaming job:- Make sure dimension batch data folder has only one file and the file should have at-least one record (for some reason placing empty file is not working)/
Step 2:- Start your streaming job and add a stream record in kafka stream
Step 3:- Overwrite dim data with values (the file should be same name don't change and the dimension folder should have only one file) Note:- don't use spark to write to this folder use Java or Scala filesystem.io to overwrite the file or bash delete the file and replace with new data file with same name.
Step 4:- In next batch spark is able to read updated dimension data while joining with kafka stream...
Sample Code:-
package com.databroccoli.streaming.streamjoinupdate
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastStreamJoin3 {
def main(args: Array[String]): Unit = {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("org").setLevel(Level.ERROR)
Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
Logger.getLogger("io.netty").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val schemaUntyped1 = StructType(
Array(
StructField("id", StringType),
StructField("customrid", StringType),
StructField("customername", StringType),
StructField("countrycode", StringType),
StructField("timestamp_column_fin_1", TimestampType)
))
val schemaUntyped2 = StructType(
Array(
StructField("id", StringType),
StructField("countrycode", StringType),
StructField("countryname", StringType),
StructField("timestamp_column_fin_2", TimestampType)
))
val factDf1 = spark.readStream
.schema(schemaUntyped1)
.option("header", "true")
.csv("src/main/resources/broadcasttest/fact")
val dimDf3 = spark.read
.schema(schemaUntyped2)
.option("header", "true")
.csv("src/main/resources/broadcasttest/dimension")
.withColumnRenamed("id", "id_2")
.withColumnRenamed("countrycode", "countrycode_2")
import spark.implicits._
factDf1
.join(
dimDf3,
$"countrycode_2" <=> $"countrycode",
"inner"
)
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination
}
}
Your question is a little unclear (the second piece of code doesn't use the df you want to persist so I am not sure how you intend to integrate them... I assume a join?
We had a similar issue (using Spark 2.1), and solved it by creating a custom implementation of Sink (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/Sink.scala) where the data is loaded in addBatch. Since your setting indicate that you are only processing 1 file at a time and there is no watermarking, you can probably cram your logic into the addBatch method... though this is kind of hacky (our use case was slightly different I believe).
If spark 2.2 is an option, then you are in luck. Spark 2.2 adds the "run once" trigger that allows you to use the Spark Streaming API for batch jobs (which is essentially what you are trying to do). If you modify your write stream to use this new trigger, than the infinite loop might work (though I have never tried). You might be better off using an external scheduler to run the streaming job in batch mode. You can read more about the Run Once trigger here: https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html
If you are using EMR, then Spark 2.2 isn't available yet... but I have heard it will be out in the next couple weeks (fingers crossed).
You can find some complete Sink implementation examples here: https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala

How to work with PySpark, SparkSQL and Cassandra?

I am a bit confused with the different actors in this story: PySpark, SparkSQL, Cassandra and the pyspark-cassandra connector.
As I understand Spark evolved quite a bit and SparkSQL is now a key component (with the 'dataframes'). Apparently there is absolutely no reason to work without SparkSQL, especially if connecting to Cassandra.
So my question is: what component are needed and how do I connect them together in the simplest way possible?
With spark-shell in Scala I could do simply
./bin/spark-shell --jars spark-cassandra-connector-java-assembly-1.6.0-M1-SNAPSHOT.jar
and then
import org.apache.spark.sql.cassandra.CassandraSQLContext
val cc = new CassandraSQLContext(sc)
cc.setKeyspace("mykeyspace")
val dataframe = cc.sql("SELECT count(*) FROM mytable group by beamstamp")
How can I do that with pyspark?
Here are a couple of subquestions along with partial answers I have collected (correct if I'm wrong).
Is pyspark-casmandra needed (I don't think so - I don't understand what is was doing in the first place)
Do I need to use pyspark or could I use my regular jupyter notebook and import the necessary things myself?
Pyspark should be started with the spark-cassandra-connector package as described in the Spark Cassandra Connector python docs.
./bin/pyspark
--packages com.datastax.spark:spark-cassandra-connector_$SPARK_SCALA_VERSION:$SPARK_VERSION
With this loaded you will be able to use any of the Dataframe operations already present inside of Spark on C* dataframes. More details on options of using C* dataframes.
To set this up to run with jupyter notebook just set up your env with the following properties.
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS=notebook
And calling pyspark will start up a notebook correctly configured.
There is no need to use pyspark-cassandra unless you are interseted in working with RDDs in python which has a few performance pitfalls.
In Python connector is exposed DataFrame API. As long as spark-cassandra-connector is available and SparkConf contains required configuration there is no need for additional packages. You can simply specify the format and options:
df = (sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="mytable", keyspace="mykeyspace")
.load())
If you wan to use plain SQL you can register DataFrame as follows:
df.registerTempTable("mytable")
## Optionally cache
sqlContext.cacheTable("mytable")
sqlContext.sql("SELECT count(*) FROM mytable group by beamstamp")
Advanced features of the connector, like CassandraRDD are not exposed to Python so if you need something beyond DataFrame capabilities then pyspark-cassandra may prove useful.

How to enable streaming from Cassandra to Spark?

I have the following spark job:
from __future__ import print_function
import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("PySpark Cassandra Test")
sc = CassandraSparkContext(conf=conf)
stream = StreamingContext(sc, 2)
rdd=sc.cassandraTable("keyspace2","users").collect()
#print rdd
stream.start()
stream.awaitTermination()
sc.stop()
When I run this, it gives me the following error:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute
the shell script I run:
./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py
Comparing spark streaming with kafka, I have this line missing from the above code:
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})
where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.
Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.
See also: Reading from Cassandra using Spark Streaming for an example.

Getting Tableau to talk to Spark and Cassandra

The DataStax spark cassandra connector is great for interacting with Cassandra through Apache Spark. With Spark SQL 1.1, we can use the thrift server to interact with Spark with Tableau. Since Tableau can talk to Spark, and Spark can talk to Cassandra, there's surely some way to get Tableau talking to Cassandra through Spark (or rather Spark SQL). I can't figure out how to get this running. Ideally, I'd like to do this with Spark Standalone cluster + a cassandra cluster (i.e. without additional hadoop set up). Is this possible? Any pointers are appreciated.
The HiveThriftServer has a HiveThriftServer2.startWithContext(sqlContext) option so you could create your sqlContext referencing C* and the appropriate table / CF and then pass that context to the thrift server.
So something like this:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.catalyst.types._
import java.sql.Date
val sparkContext = sc
import sparkContext._
val sqlContext = new HiveContext(sparkContext)
import sqlContext._
makeRDD((1,"hello") :: (2,"world") ::Nil).toSchemaRDD.cache().registerTempTable("t")
import org.apache.spark.sql.hive.thriftserver._
HiveThriftServer2.startWithContext(sqlContext)
So instead of starting the default thriftserver from Spark you could just lunch you cusotm one.

Resources