How to enable backpressure in Spark Streaming (using pyspark) - apache-spark

I would like know what would be the correct way of enable backpressure in spark streaming through pyspark. It looks like I have too many messages sent from Kafka in a short time and explode to it. Below is my code for spark streaming. Can anyone point me to the correct place to enable back pressure?
sc = SparkContext(appName="PythonStreamingDirectKafka")
ssc = StreamingContext(sc, 5)
ssc.checkpoint("/spark_check/")
kvs = KafkaUtils.createDirectStream(ssc, [kafka_topic],
{"metadata.broker.list": bootstrap_servers_ipaddress})
parsed_msg = kvs.map(lambda (key, value): json.loads(value))
## do something below

Here is how i set backpressure in my kafka streaming code.
Hope it helps.
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("PythonStreamingDirectKafka")\
.set("spark.streaming.backpressure.enabled", "true") \
.set("spark.streaming.backpressure.initialRate", "500")
sc = SparkContext(conf=conf)

Related

PySpark Cassandra Databese Connection Problem

I am trying to use cassandra with pyspark. I can make a remote connection to Spark Server properly. But the stage of read cassandra table, I am in trouble. I tried all of datastax connectors, i changed Spark configs(core, memory, etc) but I couldnt accomplish it. (The comment rows in below code are my tries.)
Here is my python codes;
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
The error I get is;
Someone have any idea with that?
This happens because you're specifying only spark.jars property, and pointing to the single jar. But spark cassandra connector depends on the number of the additional jars that aren't included into that list. I recommend instead either use spark.jars.packages with coordinate com.datastax.spark:spark-cassandra-connector_2.12:3.0.0, or specify in spark.jars the path to the assembly jar that has all necessary dependencies.
btw, 3.0 was release several months ago - why are you still using beta?

how to write spark context in java in spark 1.6.3 , so that to convert file into dataframe.

i have tried like this. but no luck.File1 and file2 are in my local machine. Not in the hdfs. Please help.
SparkConf sparkConf = new SparkConf().setAppName("sample");
SparkContext sc = new SparkContext(sparkConf);
SQLContext sqlContext = SQLContext.getOrCreate(sc);
val file1=sc.textFile("file1.txt", minPartitions);
val file2=sc.textFile("file2.txt", minPartitions);

PySpark + jupyter notebook

I am trynig to configure a spark context into my notebook, but there is something wrong, I do :
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
if sc==sc:
sc.stop()
if spark==spark:
spark.stop()
conf = SparkConf()
conf = conf.setAppName(appName)
conf = conf.set("spark.master", master)
conf = conf.set("spark.python.worker.memory", "1042M")
spark.stop()
session_builder = SparkSession.builder
session_builder = session_builder.master(master)
spark = session_builder.getOrCreate()
and this give me an error :
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Can we change the configuration of spark in a jupyter notebook ?
And how ?
I am on the last version of spark with a standalone cluster.
Following the propose action I did :
which seems to mean the spark Context has been recreated, but the sparSession is not linked to the new sc anymore.
Just use the config option when setting SparkSession (as of 2.4)
MAX_MEMORY = "5g"
spark = SparkSession \
.builder \
.appName("Foo") \
.config("spark.executor.memory", MAX_MEMORY) \
.config("spark.driver.memory", MAX_MEMORY) \
.getOrCreate()
From the code above, what I understand is sc is you sparkcontext and spark is your sparkSession variable. You are stopping both of them and then using spark.stop() again on an already terminated session. Instead use this:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = (SparkConf()
.setMaster("local")
.setAppName("App_name")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
You can find the documentation here: Pyspark
If you have configured your notebook with pyspark, you don't need to stop a spark context and create a new one. Instead you can you sc as you spark context. You can pass additional configurations via spark-submit as command line arguments. You can refer the configuration documentation here:Pyspark Configuration

Seek to Beginning of Kafka Topic Using PySpark

Using a Kafka Stream in PySpark, is it possible to seek to the beginning of a Kafka topic without creating a new consumer group?
For example, I have the following code snippet:
...
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 pyspark-shell'
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext('local[2]', appName="MyStreamingApp_01")
sc.setLogLevel("INFO")
ssc.StreamingContext(sc, 30)
spark = SparkSession(sc)
kafkaStream = KafkaUtils.createStream(ssc, zookeeper_ip, 'group-id', {'messages': 1})
counted = kafkaStream.count()
...
My goal is to do something along the lines of
kafkaStream.seekToBeginningOfTopic()
Currently, I'm creating a new consumer group to re-read from the beginning of the topic, e.g.:
kafkaStream = KafkaUtils.createStream(ssc, zookeeper, 'group-id-2', {'messages': 1}, {"auto.offset.reset": "smallest"})
Is this the proper way to consume a topic from the beginning using PySpark?

Kafka integration with spark

I want to setup a streaming application using Apache Kafka and Spark streaming. Kafka is running on a seperate unix machine version 0.9.0.1 and spark v1.6.1 is a part of a hadoop cluster.
I have started the zookeeper and kafka server and want to stream in messages from a log file using console producer and consumed by spark streaming application using direct method (no receivers). I have written code in python and executing using the below command:
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.6.1.jar streamingDirectKafka.py
getting below error:
/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/streaming/kafka.py", line 152, in createDirectStream
py4j.protocol.Py4JJavaError: An error occurred while calling o38.createDirectStreamWithoutMessageHandler.
: java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
Could you please help?
Thanks!!
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
conf = SparkConf().setAppName("StreamingDirectKafka")
sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 1)
topic = ['test']
kafkaParams = {"metadata.broker.list": "apsrd7102:9092"}
lines = (KafkaUtils.createDirectStream(ssc, topic, kafkaParams)
.map(lambda x: x[1]))
counts = (lines.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a+b))
counts.pprint()
ssc.start()
ssc.awaitTermination()
Looks like you are using incompatible version of Kafka. From the documentation as of Spark 2.0 - Kafka 0.8.x is supported.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Resources