StructuredStreaming with ForEachWriter creating duplicates - apache-spark

Hi I'm trying to create a neo4j sink using pyspark and kafka, but for some reason this sink is creating duplicates in neo4j and I'm not sure why this is happening. I am expecting to get only one node, but it looks like it's creating 4. If someone has an idea, please let me know.
Kafka producer code:
from kafka import KafkaProducer
import json
producer = KafkaProducer(bootstrap_servers='10.0.0.38:9092')
message = {
'test_1': 'test_1',
'test_2': 'test_2'
}
producer.send('test_topic', json.dumps(message).encode('utf-8'))
producer.close()
Kafka consumer code:
from kafka import KafkaConsumer
import findspark
from py2neo import Graph
import json
findspark.init()
from pyspark.sql import SparkSession
class ForeachWriter:
def open(self, partition_id, epoch_id):
neo4j_uri = '' # neo4j uri
neo4j_auth = ('', '') # neo4j user, password
self.graph = Graph(neo4j_uri, auth=neo4j_auth)
return True
def process(self, msg):
msg = json.loads(msg.value.decode('utf-8'))
self.graph.run("CREATE (n: MESSAGE_RECEIVED) SET n.key = '" + str(msg).replace("'", '"') + "'")
raise KeyError('received message: {}. finished creating node'.format(msg))
spark = SparkSession.builder.appName('test-consumer') \
.config('spark.executor.instances', 1) \
.getOrCreate()
ds1 = spark.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', '10.0.0.38:9092') \
.option('subscribe', 'test_topic') \
.load()
query = ds1.writeStream.foreach(ForeachWriter()).start()
query.awaitTermination()
neo4j graph after running code

After doing some searching, I found this snippet of text from Stream Processing with Apache Spark: Mastering Structured Streaming and Spark Streaming on chapter 11 p151 after describing open, process, and close for ForeachWriter:
This contract is part of the data delivery semantics because it allows us to remove duplicated partitions that might already have been sent to the sink but are reprocessed by Structured Streaming as part of a recovery scenario. For that mechanism to properly work, the sink must implement some persistent way to remember the partition/version combinations that it has already seen.
On another note from the spark website: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html (see section on Foreach).
Note: Spark does not guarantee same output for (partitionId, epochId), so deduplication cannot be achieved with (partitionId, epochId). e.g. source provides different number of partitions for some reasons, Spark optimization changes number of partitions, etc. See SPARK-28650 for more details. If you need deduplication on output, try out foreachBatch instead.
It seems like I need to implement a check for uniqueness because Structured Streaming automatically reprocesses partitions in case of a fail if I am to use ForeachWriter, otherwise I have to switch to foreachBatch instead.

Related

Databricks spark best practice for writing a structured stream to a lot of sinks?

I'm using databricks spark 3.x, and I am reading a very large number of streams (100+), and each stream has its own contract, and needs to be written out to its own delta/parquet/sql/whatever table. While this is a lot of streams, the activity per stream is low - some streams might see only hundreds of records a day. I do want to stream because I am aiming for a fairly low-latency approach.
Here's what I'm talking about (code abbreviated for simplicity; I'm using checkpoints, output modes, etc. correctly).
Assume a schemas variable contains the schema for each topic. I've tried this approach, where I create a ton of individual streams, but it takes a lot of compute and most of it is wasted:
def batchprocessor(topic, schema):
def F(df, batchId):
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
df.createOrReplaceTempView(f"SOME MERGE TABLE")
df._jdf.sparkSession().sql(sql)
return F
for topic in topics:
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-one-table-per-topic/{topic}")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*'))
.writeStream
.format("delta")
.foreachBatch(batchProcessor(topic, schema))
.start())
I also tried to create just one stream that did a ton of filtering, but performance was pretty abysmal even in a test environment where I pushed a single message to a single topic:
def batchprocessor(df, batchId):
df.cache()
for topic in topics:
filteredDf = (df.filter(f"topic == '{topic}'")
.withColumn('json', from_json(col('value'),schemas[topic]))
.select(col('json.*')))
sql = f'''
MERGE INTO SOME TABLE
USING SOME MERGE TABLE ON SOME CONDITION
WHEN MATCHED
UPDATE SET *
WHEN NOT MATCHED
INSERT *
'''
filteredDf.createOrReplaceTempView(f"SOME MERGE TABLE")
filteredDf._jdf.sparkSession().sql(sql)
df.unpersist()
query = (spark
.readStream
.format("delta")
.load(f"/my-stream-all-topics-in-one-but-partitioned")
.writeStream
.format("delta")
.foreachBatch(batchProcessor)
.start())
Is there any good way to essentially demultiplex a stream like this? It's already partitioned, so I assume the query planner isn't doing too much redundant work, but it seems like there's a huge amount of overhead nonetheless.
I ran a bunch of benchmarks, and option 2 is more efficient. I don't entirely know why yet.
Ultimately, performance still wasn't what I wanted - each topic runs in order, no matter the size, so a single record on each topic would lead the FIFO scheduler to queue up a lot of very inefficient small operations. I solved that using parallelisation:
import threading
def writeTable(table, df, poolId, sc):
sc.setLocalProperty("spark.scheduler.pool", poolId)
df.write.mode('append').format('delta').saveAsTable(table)
sc.setLocalProperty("spark.scheduler.pool", None)
def processBatch(df, batchId):
df.cache()
dfsToWrite = {}
for row in df.select('table').distinct().collect():
table = row.table
filteredDf = df.filter(f"table = '{table}'")
dfsToWrite[table] = filteredDf
threads = []
for table,df in dfsToWrite.items():
threads.append(threading.Thread(target=writeTable,args=(table, df,table,spark.sparkContext)))
for t in threads:
t.start()
for t in threads:
t.join()
df.unpersist()

How do I consume Kafka topic inside spark streaming app?

When I create a stream from Kafka topic and print its content
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext(appName="PythonStreamingKafkaWords")
ssc = StreamingContext(sc, 10)
lines = KafkaUtils.createDirectStream(ssc, ['sample_topic'], {"bootstrap.servers": 'localhost:9092'})
lines.pprint()
ssc.start()
ssc.awaitTermination()
I get an empty result
-------------------------------------------
Time: 2019-12-07 13:11:50
-------------------------------------------
-------------------------------------------
Time: 2019-12-07 13:12:00
-------------------------------------------
-------------------------------------------
Time: 2019-12-07 13:12:10
-------------------------------------------
Meanwhile, it works in the console:
kafka-console-consumer --topic sample_topic --from-beginning --bootstrap-server localhost:9092
correctly gives me all lines of my text in Kafka topic:
ham Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...
ham Ard 6 like dat lor.
ham Why don't you wait 'til at least wednesday to see if you get your .
ham Huh y lei...
spam REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
spam This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
ham Will ü b going to esplanade fr home?
. . .
What is the proper way to stream data from Kafka topic into Spark streaming app?
Based on your code ,We can't print the streaming RDD directly and should be printing based on the foreachRDD .DStream.foreachRDD is an "output operator" in Spark Streaming. It allows you to access the underlying RDDs of the DStream to execute actions that do something practical with the data.
What's the meaning of DStream.foreachRDD function?
Note:: Still You can achieve through structured streaming as well. ref : Pyspark Structured streaming processing
Sample working code : This code trying to read the message from kafka topic and printing it. You can change this code based on your requirement.
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
def handler(message):
records = message.collect()
for record in records:
print(record[1])
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 10)
kvs = KafkaUtils.createDirectStream(ssc, ['topic_name'], {"metadata.broker.list": 'localhost:9192'},valueDecoder=serializer.decode_message)
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
The reason that you are not seeing any data in streaming output is because spark streaming starts reading data from latest by default. So if you start your spark streaming application first and then write data to Kafka, you will see output in streaming job. Refer documentation here:
By default, it will start consuming from the latest offset of each Kafka partition
But you can also read data from any specific offset of your topic. Take a look at createDirectStream method here. It takes a dict parameter fromOffsets where you can specify the offset per partition in a dictionary.
I have tested below code with kafka 2.2.0 and spark 2.4.3 and Python 3.7.3:
Start pyspark shell with kafka dependencies:
pyspark --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.4.0
Run below code:
from pyspark.streaming.kafka import KafkaUtils, TopicAndPartition
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
topicPartion = TopicAndPartition('test',0)
fromOffset = {topicPartion: 0}
lines = KafkaUtils.createDirectStream(ssc, ['test'],{"bootstrap.servers": 'localhost:9092'}, fromOffsets=fromOffset)
lines.pprint()
ssc.start()
ssc.awaitTermination()
Also you should consider using Structured Streaming instead Spark Streaming if you have kafka broker version 10 or higher. Refer Structured Streaming documentation here and Structured Streaming with Kafka integration here.
Below is a sample code to run in Structured Streaming.
Please use jar version according to your Kafka version and spark version.
I am using spark 2.4.3 with Scala 11 and kafka 0.10 so using jar spark-sql-kafka-0-10_2.11:2.4.3.
Start pyspark shell:
pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test") \
.option("startingOffsets", "earliest") \
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream \
.format("console") \
.start()
I recommend to use Spark structured streaming. It's the new generation streaming engine comes with the release of Spark 2. You can check it in this link.
For Kafka integration, you can look at the docs at this link.

Spark dataframe lose streaming capability after accessing Kafka source

I use Spark 2.4.3 and Kafka 2.3.0. I want to do Spark structured streaming with data coming from Kafka to Spark. In general it does work in the test mode but since I have to do some processing of the data (and do not know another way to do) the Spark data frames do not have the streaming capability anymore.
#!/usr/bin/env python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
# create schema for data
schema = StructType([StructField("Signal", StringType()),StructField("Value", DoubleType())])
# create spark session
spark = SparkSession.builder.appName("streamer").getOrCreate()
# create DataFrame representing the stream
dsraw = spark.readStream \
.format("kafka").option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test")
print("dsraw.isStreaming: ", dsraw.isStreaming)
# Convert Kafka stream to something readable
ds = dsraw.selectExpr("CAST(value AS STRING)")
print("ds.isStreaming: ", ds.isStreaming)
# Do query on the converted data
dsQuery = ds.writeStream.queryName("ds_query").format("memory").start()
df1 = spark.sql("select * from ds_query")
print("df1.isStreaming: ", df1.isStreaming)
# convert json into spark dataframe cols
df2 = df1.withColumn("value", from_json("value", schema))
print("df2.isStreaming: ", df2.isStreaming)
The output is:
dsraw.isStreaming: True
ds.isStreaming: True
df1.isStreaming: False
df2.isStreaming: False
So I lose the streaming capability when I create the first dataframe. How can I avoid it? How do I get a streaming Spark dataframe out of a stream?
It is not recommend to use the memory sink for production applications as all the data will be stored in the driver.
There is also no reason to do this, except for debugging purposes, as you can process your streaming dataframes like the 'normal' dataframes. For example:
import pyspark.sql.functions as F
lines = spark.readStream.format("socket").option("host", "XXX.XXX.XXX.XXX").option("port", XXXXX).load()
words = lines.select(lines.value)
words = words.filter(words.value.startswith('h'))
wordCounts = words.groupBy("value").count()
wordCounts = wordCounts.withColumn('count', F.col('count') + 2)
query = wordCounts.writeStream.queryName("test").outputMode("complete").format("memory").start()
In case you still want to go with your approach: Even if df.isStreaming tells you it is not a streaming dataframe (which is correct), the underlying datasource is a stream and the dataframe will therefore grow with each processed batch.

Call a function with each element a stream in Databricks

I have a DataFrame stream in Databricks, and I want to perform an action on each element. On the net I found specific purpose methods, like writing it to the console or dumping into memory, but I want to add some business logic, and put some results into Redis.
To be more specific, this is how it would look like in non-stream case:
val someDataFrame = Seq(
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
("key4", "value4")
).toDF()
def someFunction(keyValuePair: (String, String)) = {
println(keyValuePair)
}
someDataFrame.collect.foreach(r => someFunction((r(0).toString, r(1).toString)))
But if the someDataFrame is not a simple data frame but a stream data frame (indeed coming from Kafka), the error message is this:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Could anyone please help me solving this problem?
Some important notes:
I've read the relevant documentation, like Spark Streaming or Databricks Streaming and a few other descriptions as well.
I know that there must be something like start() and awaitTermination, but I don't know the exact syntax. The descriptions did not help.
It would take pages to list all the possibilities I tried, so I rather not provide them.
I do not want to solve the specific problem of displaying the result. I.e. please do not provide a solution to this specific case. The someFunction would look like this:
val someData = readSomeExternalData()
if (condition containing keyValuePair and someData) {
doSomething(keyValuePair);
}
(Question What is the purpose of ForeachWriter in Spark Structured Streaming? does not provide a working example, therefore does not answer my question.)
Here is an example of reading using foreachBatch to save every item to redis using the streaming api.
Related to a previous question (DataFrame to RDD[(String, String)] conversion)
// import spark and spark-redis
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.streaming._
import org.apache.spark.sql.types._
import com.redislabs.provider.redis._
// schema of csv files
val userSchema = new StructType()
.add("name", "string")
.add("age", "string")
// create a data stream reader from a dir with csv files
val csvDF = spark
.readStream
.format("csv")
.option("sep", ";")
.schema(userSchema)
.load("./data") // directory where the CSV files are
// redis
val redisConfig = new RedisConfig(new RedisEndpoint("localhost", 6379))
implicit val readWriteConfig: ReadWriteConfig = ReadWriteConfig.Default
csvDF.map(r => (r.getString(0), r.getString(0))) // converts the dataset to a Dataset[(String, String)]
.writeStream // create a data stream writer
.foreachBatch((df, _) => sc.toRedisKV(df.rdd)(redisConfig)) // save each batch to redis after converting it to a RDD
.start // start processing
Call simple user defined function foreachbatch in spark streaming.
please try this,
it will print 'hello world' for every message from tcp socket
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession .builder .appName("StructuredNetworkWordCount") .getOrCreate()
# Create DataFrame representing the stream of input lines from connection tolocalhost:9999
lines = spark .readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
def process_row(df, epoch_id):
# # Write row to storage
print('hello world')
query = words.writeStream.foreachBatch(process_row).start()
#query = wordCounts .writeStream .outputMode("complete") .format("console") .start()
query.awaitTermination()

How to transform structured streams with PySpark?

This seems like it should be obvious, but in reviewing the docs and examples, I'm not sure I can find a way to take a structured stream and transform using PySpark.
For example:
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName('StreamingWordCount')
.getOrCreate()
)
raw_records = (
spark
.readStream
.format('socket')
.option('host', 'localhost')
.option('port', 9999)
.load()
)
# I realize there's a SQL function for upper-case, just illustrating a sample
# use of an arbitrary map function
records = raw_records.rdd.map(lambda w: w.upper()).toDF()
counts = (
records
.groupBy(records.value)
.count()
)
query = (
counts
.writeStream
.outputMode('complete')
.format('console')
.start()
)
query.awaitTermination()
This will throw the following exception:
Queries with streaming sources must be executed with writeStream.start
However, if I remove the call to rdd.map(...).toDF() things seem to work fine.
Seems as though the call to rdd.map branched execution from the streaming context and causes Spark to warn that it was never started?
Is there a "right" way to apply map or mapPartition style transformations using Structured Streaming and PySpark?
Every transformation that is applied in Structured Streaming has to be fully contained in Dataset world - in case of PySpark it means you can use only DataFrame or SQL and conversion to RDD (or DStream or local collections) are not supported.
If you want to use plain Python code you have to use UserDefinedFunction.
from pyspark.sql.functions import udf
#udf
def to_upper(s)
return s.upper()
raw_records.select(to_upper("value"))
See also Spark Structured Streaming and Spark-Ml Regression
Another way for a specific column (column_name):
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def to_uper(string):
return string.upper()
to_upper_udf = udf(to_upper,StringType())
records = raw_records.withColumn("new_column_name"
,to_upper_udf(raw_records['column_name']))\
.drop("column_name")

Resources