I'm trying to run 2 pyspark streaming job separate in two steps in AWS EMR Cluster.
When I start the first stream job it runs normally and status stays in RUNNING.
When I try to run the second job it gets stuck in ACCEPTED status.
As shown in image bellow:
My pyspark streaming code simplified:
spark = SparkSession.builder.getOrCreate()
df = spark.readStream\
.format("csv")\
.option("delimiter","|")\
.option("Header",True)\
.option("multiLine",True)\
.option('ignoreLeadingWhiteSpace',True)\
.option('ignoreTrailingWhiteSpace',True)\
.option("escape", "\"")\
.load(f"s3a://input_bucket/path/")\
.withColumn("file_path", input_file_name())
def for_each_batch(df, batchId):
df.write.format("delta").mode("append").save('s3a://output_bucket/path/')
query_changes = df.writeStream \
.foreachBatch(for_each_batch) \
.option("checkpointLocation", f"./checkpoint")\
.start()
query_changes.awaitTermination()
In the step page of the EMR cluster both jobs are with RUNNING status.
I suspect that await streaming function is blocking the execution.
Related
I am trying to use structured streaming in databrick with socket as source, and console as the output sink.
However, I am not able to see any output on databrick.
from pyspark.sql.functions import *
lines = (spark
.readStream.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load())
countdf = lines.select(split(col("value"), "\\s").alias("word")).groupBy("word").count()
checkpointDir = "/tmp/streaming"
streamingQuery = (countdf
.writeStream
.format("console")
.outputMode("complete")
.trigger(processingTime="1 second")
.option("checkpointLocation", checkpointDir)
.start())
In another terminal, send data via socket
I am not able to see any updates/changes in the dashboard, and there is no output shown. When I try to show the countdf, it is showing AnalysisException: Queries with streaming sources must be executed with writeStream.start();
You can't use .show on the streaming queries. Also, in the case of the console output, it's printed into logs, not into the notebook. If you just want to see the results of your transformations, on Databricks you can use display function that supports visualization of streaming datasets, including settings for checkpoint location & trigger interval.
We have a spark structured streaming job written in scala running in production which reads from a kafka topic and writes to HDFS sink. Triggertime is 60 Seconds. The job has been deployed 4months back and after running well for a month, we started getting the below error and job fails instantly:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/XYZ/hive/prod/landing_dir/abc/_spark_metadata/.edfac8fb-aa6d-4c0e-aa19-48f44cc29774.tmp (inode 6736258272) Holder DFSClient_NONMAPREDUCE_841796220_49 does not have any open files
Earlier this issue was not regular i.e. it was happening once in 2-3 weeks. Last 1 month, these error has become very frequent and happening at an interval of 3-4 days and failing the job. We restart this job once in a week as part of regular maintenance. Spark version is 2.3.2 and we run on YARN cluster manager. From the error it is evident that something is not going right within Write Ahead Log(WAL) directory since the path is pointing to _spark_metadata. Would like to understand what causing this exception and how we can handle it. Is this something we can handle in our application or is it an environment issue need to be addressed at the infra level.
Below is the code snippet:
val spark = SparkSession
.builder
.master(StreamerConfig.sparkMaster)
.appName(StreamerConfig.sparkAppName)
.getOrCreate()
spark.conf.set("spark.sql.orc.impl", "native")
spark.conf.set("spark.streaming.stopGracefullyOnShutdown","true")
spark.conf.set("spark.sql.files.ignoreCorruptFiles","true")
spark.conf.set("spark.dynamicAllocation.enabled","true")
spark.conf.set("spark.shuffle.service.enabled","true")
val readData = spark
.readStream
.format("kafka") .option("kafka.bootstrap.servers",StreamerConfig.kafkaBootstrapServer)
.option("subscribe",StreamerConfig.topicName)
.option("failOnDataLoss", false)
.option("startingOffsets",StreamerConfig.kafkaStartingOffset) .option("maxOffsetsPerTrigger",StreamerConfig.maxOffsetsPerTrigger)
.load()
val deserializedRecords = StreamerUtils.deserializeAndMapData(readData,spark)
val streamingQuery = deserializedRecords.writeStream
.queryName(s"Persist data to hive table for ${StreamerConfig.topicName}")
.outputMode("append")
.format("orc")
.option("path",StreamerConfig.hdfsLandingPath)
.option("checkpointLocation",StreamerConfig.checkpointLocation)
.partitionBy("date","hour")
.option("truncate","false")
.trigger(Trigger.ProcessingTime(StreamerConfig.triggerTime))
.start()
I have a Structured Streaming Job with Trigger.Once() enabled which I run each 20 minutes. After each running, I wat remove my processed parquet files from S3, so I enabled the cleanSource delete option, but it does not work and I don't know why !
Before showing my code, I have to comment about him. I'm running multiple structured streaming queries in paralell, I have 5 buckets and I submit this in parallel. The job works perfectly, but does not delete any processed files.
var table = ['table1','table2','table3','table4','table5']
tables.par.map(table => {
new ReplicationTables().run(table)
})
object ReplicationTables {
def run(table): Unit = {
val dataFrame = spark.readStream
.option("mergeSchema", "true")
.schema(dfSchema)
.option("cleanSource","delete")
.parquet(s"s3a://my-bucket/${table}/*")
// I do some transformation and after I write my new dataframe called df to S3 in Delta format
df.writeStream
.format("delta")
.outputMode("append")
.queryName(s"Delta/${table.schema}/${table.name}")
.trigger(Trigger.Once())
.option("checkpointLocation", s"s3a://my-bucket/checkpoints/${table.schema}/${table.name}")
.start(s"s3a://my-bucket/Delta_Tables/${table}/")
.awaitTermination()
}
}
PS: Even with INFO log level I does not have any logs about the cleanSource
PS 2: Follow the docs of Structured Streaming about cleanSource https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources
Try using option("spark.sql.streaming.fileSource.cleaner.numThreads", "10") to speedup cleanup. If more files are getting generated in less time, then Spark don't delete. May be increasing threads helps
I am new to kafka and spark structured streaming. I want to know how spark in batch mode knows which offset to read from? If I specify "startingOffsets" as "earliest", I am only getting the latest records and not all the records in the partition. I ran the same code in 2 different clusters. Cluster A ( local machine ) fetched 6 records, Cluster B ( TST cluster - very first run) fetched 1 record.
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", broker) \
.option("subscribe", topic) \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest" ) \
.load()
I am planning to run my batch once a day, will I get all the records from the yesterday to current run? Where do i see offsets and commits for batch queries?
According to the Structured Streaming + Kafka Integration Guide your offsets are stored in the provided checkpoint location that you set in the write part of your batch job.
If you do not delete the checkpoint files, the job will continue to read from Kafka where it left off. If you delete the checkpoint files or if you run the job for the very first time the job will consume messages based on the option startingOffsets.
My goal is to read streaming data from a stream(in my case aws kinesis) and then query the data. The problem is that I want to query the last 5 minutes data on every batch interval. And what I found is that it is possible to keep the data in a stream for a certain period(using StreamingContext.remember(Duration duration) method). Zeppelin's spark interpreter creates the SparkSession automatically and I don't know how to configure the StreamingContext. Here's what I do:
val df = spark
.readStream
.format("kinesis")
.option("streams", "test")
.option("endpointUrl", "kinesis.us-west-2.amazonaws.com")
.option("initialPositionInStream", "latest")
.option("format", "csv")
.schema(//schema definition)
.load
So far so good. Then as far as I can see the streaming context is started when the write stream is set and started:
df.writeStream
.format(//output source)
.outputMode("complete")
.start()
But having only the SparkSession I don't know how to achieve a query over last X minutes data. Any suggestions?