Spark job doesn't start consuming from huge Kafka topic - apache-spark

I am facing with a supposedly simple problem which anyway is making me scratching my head against wall.
I've set up a Kafka cluster (MSK in AWS) with one topic and 200 partitions, right now the topic has collected 100M events and 1TB of data size.
MSK is configured with 6 broker kafka.m5.4xlarge and this is the basic config:
log.retention.ms = 300000
message.max.bytes = 10485760
replica.fetch.max.bytes = 10485760
replica.fetch.response.max.bytes = 10485760
socket.receive.buffer.bytes = 10485760
socket.request.max.bytes = 10485760
socket.send.buffer.bytes = 10485760
I want to process these events one by one using a Spark cluster, so I have created a simple Spark job with this code:
from pyspark.sql import SparkSession
from src import util, event_schema
def print_row(row):
print(row)
if __name__ == "__main__":
config = util.get_config()
spark_session = SparkSession \
.builder \
.appName('test') \
.getOrCreate()
# Read from kafka
events_df = spark_session.readStream \
.format('kafka') \
.option('kafka.bootstrap.servers', config['kafka']['bootstrap_servers']) \
.option('kafka.sasl.jaas.config', f'org.apache.kafka.common.security.scram.ScramLoginModule required username="{config["kafka"]["username"]}" password="{config["kafka"]["password"]}";') \
.option('kafka.sasl.mechanism', 'SCRAM-SHA-512') \
.option('kafka.security.protocol', 'SASL_SSL') \
.option('subscribe', config['kafka']['topic']) \
.option('groupIdPrefix', 'test') \
.option('failOnDataLoss', 'false') \
.load()
events_df = events_df.selectExpr('CAST(value AS STRING) as data')
events_df = events_df.select(explode(split(events_df.data, '\n')))
events_df = events_df.select(from_json(col('col'), event_schema).alias('value'))
events_df = events_df.selectExpr('value.*')
events_df.writeStream \
.foreach(print_row) \
.start()
This simple Spark job should start consuming every single event and printing that event.
When the topic is empty, it correctly start consuming, however if I attach this consumer group to the existing topic with this amount of data, it simply doesn't start consuming at all, like if it's stuck. The same doesn't happen if I write a simple Kafka consumer (not using PySpark), it correctly starts consuming (even tho it takes few mins to start).
What is wrong with my code and how could I be able to simply start consuming events from Kafka topic straight away?
Thanks

Please follow the spark-kafka integration guide while reading stream from kafka into spark. Try to explore following options present on the same page referred by above link:
startingOffsets: This is the latest by default in case of reading stream which means application will be able to read the new events arriving into kafka after spark application deployment. If you want to read the historic non processed events, try to use value earliest for this option. Look for checkpointing as well.
maxOffsetsPerTrigger: This would be useful if in every trigger you want to read limited events only.

Related

Spark Structured Streaming Batch Query

I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest" ) \
.load()
I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?
According to the Structured Streaming + Kafka Integration Guide your offsets are stored in the provided checkpoint location that you set in the write part of your batch job.
If you do not delete the checkpoint files, the job will continue to read from Kafka where it left off. If you delete the checkpoint files or if you run the job for the very first time the job will consume messages based on the option startingOffsets.

Read latest records from Kafka using pyspark batch job

I am executing a batch job in pyspark, where spark will read data from kafka topic for every 5 min.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "host1:port1") \
.option("subscribePattern", "test") \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest") \
.load()
Whenever spark reads data from kafka it is reading all the data including previous batches.
I want to read data for the current batch or latest records which is not read before.
Please suggest !! Thank you.
From https://spark.apache.org/docs/2.4.5/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries
For batch queries, latest (either implicitly or by using -1 in json)
is not allowed.
Using earliest means all the data again is obtained.
You will need to define the offset explicitly every time you run like, e.g.:
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
That implies you need to save the offsets processed per partition. I am looking into this in the near future myself for a project. Some items hereunder items to help:
https://medium.com/datakaresolutions/structured-streaming-kafka-integration-6ab1b6a56dd1 stating what you observe:
Create a Kafka Batch Query
Spark also provides a feature to fetch the
data from Kafka in batch mode. In batch mode Spark will consume all
the messages at once. Kafka in batch mode requires two important
parameters Starting offsets and ending offsets, if not specified spark
will consider the default configuration which is,
startingOffsets — earliest
endingOffsets — latest
https://dzone.com/articles/kafka-gt-hdfss3-batch-ingestion-through-spark alludes as well to what you should do, with the following:
And, finally, save these Kafka topic endOffsets to file system – local or HDFS (or commit them to ZooKeeper). This will be used for the
next run of starting the offset for a Kafka topic. Here we are making
sure the job's next run will read from the offset where the previous
run left off.
This blog https://dataengi.com/2019/06/06/spark-structured-streaming/ I think has the answer for saving offsets.
Did you use check point location while writing stream data

Azure Event Hubs to Databricks, what happens to the dataframes in use

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Printing Kafka debug message on PySpark job

Is there a way to print a Kafka debug message (I am thinking about log messages that are similar to librdkafka Debug message, or kafkacat -D option), when running PySpark job?
The issue is that I used the following codes on PySpark to connect to a Kafka cluster called A, it works and printing things out to theconsole every time there is a new message coming in. But when I switched to another cluster, called B and setup the same way as cluster A, it didn't print anything out to the screen when there is new messages coming in, I can see that the message is going through just fine using kafkacat tool on both clusters.
consumer.py
from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
spark = SparkSession.builder.appName("KafkaConsumer").getOrCreate()
sc = spark.sparkContext
sqlc = SQLContext(sc)
hosts = "host1:9092,host2:9092,host3:9092"
topic = "myTopic"
securityProtocol = "SASL_PLAINTEXT"
saslMechanism = "PLAIN"
try:
df = sqlc \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", hosts) \
.option("kafka.security.protocol", securityProtocol) \
.option("kafka.sasl.mechanism", saslMechanism) \
.option("startingOffsets", "earliest") \
.option("subscribe", topic) \
.load()
dss = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
.writeStream.outputMode('append') \
.format("console") \
.start()
dss.awaitTermination()
except KeyboardInterrupt:
print 'shutting down...'
kafka.jaas
KafkaClient {
org.apache.kafka.common.security.plain.PlainLoginModule required
username="user1"
password="sssshhhh"
serviceName="kafka";
};
shell command:
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.1 \
--files "kafka.jaas" \
--driver-java-options "-Djava.security.auth.login.config=kafka.jaas" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka.jaas" \
"./consumer.py"
Seems like kafka cluster B is reachable since I am able to get the offset information from it, but it's just not reading the messages.
The issue was caused by worker nodes connection to the Kafka cluster, the worker nodes IP address weren't on the firewall whitelist on the Kafka cluster. The code above caused worker nodes to time out and keep on retrying to connect to Kafka cluster until Interrupt signal is given.
In relation to the error message itself, no error message was generated to the Master node as worker node is still attempting to connect to Kafka cluster, but every now and then there is a message printed out on the Master console saying it failed to communicate with the worker node (or some message like 'gathering information').
NOTE: This is what I presumed happen in the worker node (which I unable to log on to, due to admin rights), but there may be a log that is stored on the worker nodes. (If someone can back or prove otherwise. it'll be much appreciated)
As for the Kafka debug message itself, it looks like already printing to the screen by default if there is Error, Info or Warning happen depending on the logger level setup and in some odd instance like this one, the log message may not be directly visible to the screen.

Dynamic resource allocation for spark applications not working

I am new to Spark and trying to figure out how dynamic resource allocation works. I have spark structured streaming application which is trying to read million records at a time from Kafka and process them. My application always starts with 3 executors and never increase the number of executors.
It takes 5-10 minutes to finish the processing. I thought it will increase the number of executors(up to 10) and try to finish the processing sooner, which is not happening.What am I missing here? How is this supposed to work?
I have set below properties in Ambari for Spark
spark.dynamicAllocation.enabled = true
spark.dynamicAllocation.initialExecutors = 3
spark.dynamicAllocation.maxExecutors = 10
spark.dynamicAllocation.minExecutors = 3
spark.shuffle.service.enabled = true
Below is how my submit command looks like
/usr/hdp/3.0.1.0-187/spark2/bin/spark-submit --class com.sb.spark.sparkTest.sparkTest --master yarn --deploy-mode cluster --queue default sparkTest-assembly-0.1.jar
Spark code
//read stream
val dsrReadStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", brokers) //kafka bokers
.option("startingOffsets", startingOffsets) // start point to read
.option("maxOffsetsPerTrigger", maxoffsetpertrigger) // no. of records per batch
.option("failOnDataLoss", "true")
/****
Logic to validate format of loglines. Writing invalid log lines to kafka and store valid log lines in 'dsresult'
****/
//write stream
val dswWriteStream =dsresult.writeStream
.outputMode(outputMode) // file write mode, default append
.format(writeformat) // file format ,default orc
.option("path",outPath) //hdfs file write path
.option("checkpointLocation", checkpointdir) location
.option("maxRecordsPerFile", 999999999)
.trigger(Trigger.ProcessingTime(triggerTimeInMins))
Just to Clarify further,
spark.streaming.dynamicAllocation.enabled=true
worked only for Dstreams API. See Jira
Also, if you set
spark.dynamicAllocation.enabled=true
and run a structured streaming job, the batch dynamic allocation algorithm kicks in, which may not be very optimal. See Jira
Dynamic Resource Allocation does not work with Spark Streaming
Refer this link

Resources