Write dataframe to kafka pyspark - apache-spark

I have a spark dataframe which I would like to write to Kafka. I have tried below snippet,
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
for row in df.rdd.collect():
producer.send('topic',str(row.asDict()))
producer.flush()
This works but problem with this snippet is this is not Scalable as every time collect runs, data will be aggregated on driver node and can slow down all operations.
As foreach operation on dataframe can run in parallel on worker nodes. I tried below approach.
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ('k', 'v'))
def custom_fun(row):
producer.send('topic',str(row.asDict()))
producer.flush()
df.foreach(custom_fun)
This doesn't and gives pickling error. PicklingError: Cannot pickle objects of type <type 'itertools.count'> Not able to understand the reason behind this error. Can anyone help me understand this error or provide any other parallel solution?

The error you get looks unrelated to Kafka writes. Looks like somewhere else in your code you use itertools.count (AFAIK it is not used in Spark's source at all, it is of course possible that it comes with KafkaProducer) which is for some reason serialized with cloudpickle module. Changing Kafka writing code might have no impact at all. If KafkaProducer is the source of the error, you should be able to resolve this with forachPartition:
from kafka import KafkaProducer
def send_to_kafka(rows):
producer = KafkaProducer(bootstrap_servers = util.get_broker_metadata())
for row in rows:
producer.send('topic',str(row.asDict()))
producer.flush()
df.foreachPartition(send_to_kafka)
That being said:
or provide any other parallel solution?
I would recommend using Kafka source. Include Kafka SQL package, for example:
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
And:
from pyspark.sql.functions import to_json, col, struct
(df
.select(to_json(struct([col(c).alias(c) for c in df.columns])))
.write
.format("kafka")
.option("kafka.bootstrap.servers", botstrap_servers)
.option("topic", topic)
.save())

Related

Job are not shown on Spark WebUI

I a naive user of spark. I installed spark and using anaconda install pyspark, then run a basic code in the jupyter notebook that is given below. I then open the spark WebUI however I am unable to see any jobs either running or completed. Any comments are appreciated.
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local")\
.appName("NQlabtop")\
.config('spark.ui.port', '4050')\
.getOrCreate()
sc = spark.sparkContext
input_file=sc.textFile("C:/Users/nqazi/NQ/anscombe.json")
map = input_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))
counts = map.reduceByKey(lambda a, b: a + b)
print("counts",counts)
sc = spark.sparkContext
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Please see the image of the Spark WebUI below. I am not sure why I cannot see any of the jobs as I think it should display completed the jobs.
There two types of functions in PySpark (Spark) transformations and actions. Transformations are lazily evaluated and PySpark doesn't perform any jobs until you call an action function like show, count, collect etc.

Spark : writeStream' can be called only on streaming Dataset/DataFrame

I'm trying to retrieve tweets from my Kafka cluster to Spark Streaming in which I perform some analysis to store them in an ElasticSearch Index.
Versions :
Spark - 2.3.0
Pyspark - 2.3.0
Kafka - 2.3.0
Elastic Search - 7.9
Elastic Search Hadoop - 7.6.2
I run the following code in my Jupyter env to write the streaming dataframe into Elastic Search .
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0,org.elasticsearch:elasticsearch-hadoop:7.6.2 pyspark-shell'
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
import nltk
import logging
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from nltk.sentiment.vader import SentimentIntensityAnalyzer
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = SQLContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def analyze_sentiment(tweet):
scores = dict([('pos', 0), ('neu', 0), ('neg', 0), ('compound', 0)])
sentiment_analyzer = SentimentIntensityAnalyzer()
score = sentiment_analyzer.polarity_scores(tweet)
for k in sorted(score):
scores[k] += score[k]
return json.dumps(scores)
def process(time,rdd):
print("========= %s =========" % str(time))
try:
if rdd.count()==0:
raise Exception('Empty')
sqlContext = getSqlContextInstance(rdd.context)
df = sqlContext.read.json(rdd)
df = df.filter("text not like 'RT #%'")
if df.count() == 0:
raise Exception('Empty')
udf_func = udf(lambda x: analyze_sentiment(x),returnType=StringType())
df = df.withColumn("Sentiment",lit(udf_func(df.text)))
print(df.take(10))
df.writeStream.outputMode('append').format('org.elasticsearch.spark.sql').option('es.nodes','localhost').option('es.port',9200)\
.option('checkpointLocation','/checkpoint').option('es.spark.sql.streaming.sink.log.enabled',False).start('PythonSparkStreamingKafka_RM_01').awaitTermination()
except Exception as e:
print(e)
pass
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("INFO")
ssc = StreamingContext(sc, 20)
kafkaStream = KafkaUtils.createDirectStream(ssc, ['kafkaspark'], {
'bootstrap.servers':'localhost:9092',
'group.id':'spark-streaming',
'fetch.message.max.bytes':'15728640',
'auto.offset.reset':'largest'})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.foreachRDD(process)
ssc.start()
ssc.awaitTermination(timeout=180)
But I get the error :
'writeStream' can be called only on streaming Dataset/DataFrame;
And , it looks like I have to use .readStream , but how do I use it to read from KafkaStream without CreateDirectStream ?
Could someone please help me with writing this dataframe into Elastic Search . I am a beginner to Spark Streaming and Elastic Search and find it quite challenging . Would be happy if someone could guide me through getting this done.
.writeStream is a part of the Spark Structured Streaming API, so you need to use corresponding API to start reading the data - the spark.readStream, and pass options specific for the Kafka source that are described in the separate document, and also use the additional jar that contains the Kafka implementation. The corresponding code would look like that (full code is here):
val streamingInputDF = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.0.10:9092")
.option("subscribe", "tweets-txt")
.load()

FPGrowth: Input data is not cached pyspark

I am trying to run following example code. Even-though I have cached my data, I am getting "Input data is not cached pyspark" warning. Because of this issue, I am not able to use fp growth algorithm for large datasets.
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession
"""
An example demonstrating FPGrowth.
Run with:
bin/spark-submit examples/src/main/python/ml/fpgrowth_example.py
"""
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("FPGrowthExample")\
.getOrCreate()
# $example on$
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
df = df.cache()
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
# Display frequent itemsets.
model.freqItemsets.show()
# Display generated association rules.
model.associationRules.show()
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show()
spark.stop()
Why:
Because ml.fpm.FPGrowth converts data to RDD and runs mllib.fpm.FPGrowth on this RDD. RDD is not cached and this causes the warning in mllib code.
What can you do about it:
In your code nothing. If you think this is a big issue (shouldn't be) open a JIRA ticket and create a pull request.
Because of this issue, I am not able to use fp growth algorithm for large datasets.
It can cause unnecessary allocation and slowdown, but shouldn't be limiting. If you experience failure it is possible that parameters require tuning.

How can I cache DataFrame with Kryo Serializer in Spark?

I am trying to use Spark with Kryo Serializer to store some data with less memory cost. And now I come across a trouble, I cannot save a DataFram e(whose type is Dataset[Row]) in memory with Kryo serializer. I thought all I need to do is to add org.apache.spark.sql.Row to classesToRegister, but error still occurs:
spark-shell --conf spark.kryo.classesToRegister=org.apache.spark.sql.Row --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrationRequired=true
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.storage.StorageLevel
val schema = StructType(StructField("name", StringType, true) :: StructField("id", IntegerType, false) :: Nil)
val seq = Seq(("hello", 1), ("world", 2))
val df = spark.createDataFrame(sc.emptyRDD[Row], schema).persist(StorageLevel.MEMORY_ONLY_SER)
df.count()
Error occurs like this:
I don't think adding byte[][] to classesToRegister is a good idea. So what should I do to store a dataframe in memory with Kryo?
Datasets don't use standard serialization methods. They use specialized columnar storage with its own compression methods so you don't need to store your Dataset with the Kryo Serializer.

Spark Streaming: How to load a Pipeline on a Stream?

I am implementing a lambda architecture system for stream processing.
I have no issue creating a Pipeline with GridSearch in Spark Batch:
pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor])
paramGrid = (
ParamGridBuilder()
.addGrid(logistic_regressor.regParam, (0.01, 0.1))
.addGrid(logistic_regressor.tol, (1e-5, 1e-6))
...etcetera
).build()
cv = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=4)
pipeline_cv = cv.fit(raw_train_df)
model_fitted = pipeline_cv.getEstimator().fit(raw_validation_df)
model_fitted.write().overwrite().save("pipeline")
However, I cant seem to find how to plug the pipeline in the Spark Streaming Process. I am using kafka as the DStream source and my code as of now is as follows:
import json
from pyspark.ml import PipelineModel
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "spark- streaming-consumer", {"kafka_topic": 1})
model = PipelineModel.load('pipeline/')
parsed_stream = kafkaStream.map(lambda x: json.loads(x[1]))
CODE MISSING GOES HERE
ssc.start()
ssc.awaitTermination()
and now I need to find someway of doing
Based on the documentation here (even though it looks very very outdated) it seems like your model needs to implement the method predict to be able to use it on an rdd object (and hopefully on a kafkastream?)
How could I use the pipeline on the Streaming context? The reloaded PipelineModel only seems to implement transform
Does that mean the only way to use batch models in a Streaming context is to use pure models ,and no pipelines?
I found a way to load a Spark Pipeline into spark streaming.
This solution works for Spark v2.0 , as further versions will probably implement a better solution.
The solution I found transforms the streaming RDDs into Dataframes using the toDF() method, in which you can then apply the pipeline.transform method.
This way of doing things is horribly inefficient though.
# we load the required libraries
from pyspark.sql.types import (
StructType, StringType, StructField, LongType
)
from pyspark.sql import Row
from pyspark.streaming.kafka import KafkaUtils
#we specify the dataframes schema, so spark does not have to do reflections on the data.
pipeline_schema = StructType(
[
StructField("field1",StringType(),True),
StructField("field2",StringType(),True),
StructField("field3", LongType(),True)
]
)
#We load the pipeline saved with spark batch
pipeline = PipelineModel.load('/pipeline')
#Setup usual spark context, and spark Streaming Context
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)
#On my case I use kafka directKafkaStream as the DStream source
directKafkaStream = KafkaUtils.createDirectStream(ssc, suwanpos[QUEUE_NAME], {"metadata.broker.list": "localhost:9092"})
def handler(req_rdd):
def process_point(p):
#here goes the logic to do after applying the pipeline
print(p)
if req_rdd.count() > 0:
#Here is the gist of it, we turn the rdd into a Row, then into a df with the specified schema)
req_df = req_rdd.map(lambda r: Row(**r)).toDF(schema=pipeline_schema)
#Now we can apply the transform, yaaay
pred = pipeline.transform(req_df)
records = pred.rdd.map(lambda p: process_point(p)).collect()
Hope this helps.

Resources