Writing large DataFrame from PySpark to Kafka runs into timeout - azure

I'm trying to write a data frame which has about 230 million records to a Kafka. More specifically to a Kafka-enable Azure Event Hub, but I'm not sure if that's actually the source of my issue.
EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()
This starts up fine and writes about 3-4 million records successfully (and pretty fast) to the queue. But then the job stops after a couple of minutes with messages like those:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 248, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Expiring 61 record(s) for mytopic-18: 32839 ms has passed since last append
or
org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 8.0 failed 4 times, most recent failure: Lost task 13.3 in stage 8.0 (TID 348, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: The request timed out.
Also, I never see the checkpoint file being created/written to.
I also played around with .option("kafka.delivery.timeout.ms", 30000) and different values but that didn't seem to have any effect.
I'm running this in an Azure Databricks cluster version 5.0 (includes Apache Spark 2.4.0, Scala 2.11)
I don't see any errors like throttling on my Event Hub, so that should be ok.

Finally figured it out (mostly):
Turns out the default batch size of about 16000 messages was too large for the endpoint. After I set the batch.size parameter to 5000, it worked and is writing at about 700k messages per minute to the Event Hub. Also, the timeout parameter above was wrong and was just being ignored. It is kafka.request.timeout.ms
Only issue is that randomly it still runs in timeouts and apparently starts from the beginning again so that I'm ending up with duplicates. Will open another question for that.
dfKafka \
.write \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

Related

I am facing "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate" error while working with pyspark

I am using this tech stack:
Spark version: 3.3.1
Scala Version: 2.12.15
Hadoop Version: 3.3.4
Kafka Version: 3.3.1
I am trying to get data from kafka topic through spark structure streaming, But I am facing mentioned error, Code I am using is:
For reading data from kafka topic
result_1 = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sampleTopic1") \
.option("startingOffsets", "latest") \
.load()
For writing data on console
trans_detail_write_stream = result_1 \
.writeStream\
.trigger(processingTime='1 seconds')\
.outputMode("update")\
.option("truncate", "false")\
.format("console")\
.start()\
.awaitTermination()
For execution I am using following command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 streamer.py
I am facing this error "java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)"
and on later logs it give me this exception too
"StreamingQueryException: Query [id = 600dfe3b-6782-4e67-b4d6-97343d02d2c0, runId = 197e4a8b-699f-4852-a2e6-1c90994d2c3f] terminated with exception: Writing job aborted"
Please suggest
Edit: Screenshot for Spark Version

java.io.FileNotFoundException in Spark structured streaming job

I am trying to run spark structured streaming job that reads CSV files from local and load to HDFS in parquet format.
I start a pyspark job as follows:
pyspark2 --master yarn --executor-memory 8G --driver-memory 8G
The code looks as follows:
from pyspark.sql.types import StructType
sch = StructType(...)
spark.readStream \
.format("csv") \
.schema(sch) \
.option("header", True) \
.option("delimiter", ',') \
.load("<Load_path>") \
.writeStream \
.format("parquet") \
.outputMode("append") \
.trigger(processingTime='10 seconds') \
.option("path","<Hdfs_path>") \
.option("checkpointLocation","<Checkpoint Loc>") \
.start()
Load path is like file:////home/pardeep/file2 where file2 is a directory name (not a file).
It is running fine in start but after adding more CSV file in source folder, it is giving below error:
Caused by: java.io.FileNotFoundException: File file:<file>.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
Error is not coming always after adding first file, sometime it is after first file addition, sometime it is after second file.
There is another job that is moving file in this folder. Job is writing in temp folder and moving into this folder.
At start, there are some file present in the directory, but files are coming to directory continuously (after every 2-3 minutes). I am not sure how to do refresh (what is table name?) and when to do it because it is a streaming job.

Error loading spark sql context for redshift jdbc url in glue

Hello I am trying to fetch month-wise data from a bunch of heavy redshift table(s) in glue job.
As far as I know glue documentation on this is very limited.
The query works fine in SQL Workbench which I have connected using the same jdbc connection being used in glue 'myjdbc_url'.
Below is what I have tried and seeing error -
from pyspark.context import SparkContext
sc = SparkContext()
sql_context = SQLContext(sc)
df1 = sql_context.read \
.format("jdbc") \
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
print("Total recs for month :"+str(mnthval)+" df1 -> "+str(df1.count()))
However it shows me driver error in the logs as below -
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
at scala.Option.getOrElse(Option.scala:121)
I have used following too but to no avail. Ends up in Connection
refused error.
sql_context.read \
.format("com.databricks.spark.redshift")
.option("url", myjdbc_url) \
.option("query", mnth_query) \
.option("forward_spark_s3_credentials","true") \
.option("tempdir", "s3://my-bucket/sprk") \
.load()
What is the correct driver to use. As I am using glue which is a managed service with transient cluster in the background. Not sure what am I missing.
Please help what is the right driver?

Spark can not load the csv file

spark = SparkSession.builder \
.master("spark://ip:7077") \
.appName("usres mobile location information analysis") \
.config("spark.submit.deployMode", "client") \
.config("spark.executor.memory","2g") \
.config('spark.executor.cores', "2") \
.config("spark.executor.extraClassPath","/opt/anaconda3/jars/ojdbc6.jar") \
.config("spark.executor.pyspark.memory","2g") \
.config("spark.driver.maxResultSize", "2g") \
.config("spark.driver.memory", "2g") \
.config("spark.driver.extraClassPath","/opt/anaconda3/jars/ojdbc6.jar") \
.enableHiveSupport() \
.getOrCreate()
I am trying to read a csv file located in my local pc in report folder.But it is not located in the masted location. Is there any problem with my code. I use the following line code to read the csv file.
info_df = spark.read\
.format("csv")\
.option("header","true")\
.option("mode", "PERMISSIVE")\
.load("report/info.csv")
And I get the following error. It is showing the spark can't find the files.What is the probable solution ?
Py4JJavaError: An error occurred while calling o580.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 31, ip , executor 4): java.io.FileNotFoundException: File file:/C:/Users/taimur.islam/Desktop/banglalink/Data Science/High Value Prediction/report/info.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Spark Structured Stream Executors weird behavior

Using Spark Structured Stream, with Cloudera solution
I'm using 3 executors but when I launch the application the executor that is used it's only one.
How can I use multiple executors?
Let me give you more infos.
This is my parameters:
Command Launch:
spark2-submit --master yarn \
--deploy-mode cluster \
--conf spark.ui.port=4042 \
--conf spark.eventLog.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.kafka.consumer.poll.ms=512 \
--num-executors 3 \
--executor-cores 3 \
--executor-memory 2g \
--jars /data/test/spark-avro_2.11-3.2.0.jar,/data/test/spark-streaming-kafka-0-10_2.11-2.1.0.cloudera1.jar,/data/test/spark-sql-kafka-0-10_2.11-2.1.0.cloudera1.jar \
--class com.test.Hello /data/test/Hello.jar
The Code:
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <topic_list:9092>)
.option("subscribe", <topic_name>)
.option("group.id", <consumer_group_id>)
.load()
.select($"value".as[Array[Byte]], $"timestamp")
.map((c) => { .... })
val query = lines
.writeStream
.format("csv")
.option("path", <outputPath>)
.option("checkpointLocation", <checkpointLocationPath>)
.start()
query.awaitTermination()
Result in SparkUI:
SparkUI Image
What i expected that all executors were working.
Any suggestions?
Thank you
Paolo
Looks like there is nothing wrong in your configuration, it's just the partitions that you are using might be just one. You need to increase the partitions in your kafka producer. Usually, the partitions are around 3-4 times the number of executors.
If you don't want to touch the producer code, you can come around this by doing repartition(3) before you apply the map method, so every executor works on it's own logical partition.
If you still want you explicitly mention the work each executor gets, you could do mapPerPartion method.

Resources