Spark checkpointing has a lot of tmp.crc files - apache-spark

I am using spark structured streaming where I read a stream from Kafka and after some transformation I write the resulted stream to Kafka.
I see a lot of hidden ..*tmp.crc files within my checkpoint directory. These files are not getting cleaned up and ever growing in number.
Am I missing some configuration?
I am not running spark on Hadoop. Using EBS based volume for checkpointing.

Related

Why Apache Spark Structured Streaming job get stuck on writing to checkpoint when using DBFS / S3

I have a Spark structured streaming job that reads data from Kafka, perform simple transform and save in delta format. However, I found that the job stuck without getting any error when the executors perform following task:
23/02/07 01:34:46 INFO StateStore: Retrieved reference to StateStoreCoordinator: org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef#4552b2e8
23/02/07 01:34:46 INFO HDFSBackedStateStoreProvider: Deleted files older than 6042 for HDFSStateStoreProvider[id = (op=0,part=59),dir = s3a://bucket/path/to/table]:
I found that some docs mentioned about this question:
https://learn.microsoft.com/en-us/azure/databricks/kb/streaming/streaming-job-stuck-writing-checkpoint
https://kb.databricks.com/en_US/streaming/streaming-job-stuck-writing-checkpoint
From both docs, it mentions that the solution should be:
Solution
You should use persistent storage for streaming checkpoints.
You should not use DBFS for streaming checkpoint storage.
I would like to know
Why DBFS is not good for streaming checkpoint storage?
Isn't DBFS on S3 persistent?
Is it because DBFS on S3 has heavy network and IO overhead compared to local machine?
Thanks for your help!

Zeppelin Spark Interpreter: disable _spark_metadata when reading from HDFS data written by Spark Structured Streaming

We have a stream implemented with Spark Structured Streaming writing to an HDFS folder and thus creating the _spark_metadata subfolder, in order to achieve the exactly-once guarantee when writing to a file system.
We additionally have a mode, in which we re-generate the results of the stream for historical data in a separate folder. After re-processing has finished, we copy the re-generate subfolders under the "normal-mode" folder. You can imagine that the _spark_metadata of the "normal-mode" folder is not up-to-date anymore and this causes incorrect readings of this data in Zeppelin.
Is there a way to disable the use of the folder _spark_metadata when reading with spark from a HDFS folder?

What is the advantage of using spark with HDFS as file storage system and YARN as resource manager?

I am trying to understand if spark is an alternative to the vanilla MapReduce approach for analysis of BigData. Since spark saves the operations on the data in the memory so while using the HDFS as storage system for spark , does it take the advantage of distributed storage of the HDFS? For instance suppose i have 100GB CSV file stored in HDFS, now i want to do analysis on it. If i load that from HDFS to spark , will spark load the complete data in-memory to do the transformations or it will use the distributed environment for doing its jobs that HDFS provides for Storage which is leveraged by the MapReduce programs written in hadoop. If not then what is the advantage of using spark over HDFS ?
PS: I know spark spills on the disks if there is RAM overflow but does this spill occur for data per node(suppose 5 GB per node) of the cluster or for the complete data(100GB)?
Spark jobs can be configured to spill to local executor disk, if there is not enough memory to read your files. Or you can enable HDFS snapshots and caching between Spark stages.
You mention CSV, which is just a bad format to have in Hadoop in general. If you have 100GB of CSV, you could just as easily have less than half that if written in Parquet or ORC...
At the end of the day, you need some processing engine, and some storage layer. For example, Spark on Mesos or Kubernetes might work just as well as on YARN, but those are separate systems, and are not bundled and tied together as nicely as HDFS and YARN. Plus, like MapReduce, when using YARN, you are moving the execution to the NodeManagers on the datanodes, rather than pulling over data over the network, which you would be doing with other Spark execution modes. The NameNode and ResourceManagers coordinate this communication for where data is stored and processed
If you are convinced that MapReduceV2 can be better than Spark, I would encourage looking at Tez instead

Save each Kafka messages in hdfs using spark streaming

I am using spark streaming to do analysis. after analysis I have to save the kafka message in hdfs. Each kafka message is a xml file. I can't use rdd.saveAsTextFile because it will save whole rdd. Each element of rdd is kafka message ( xml file ). How to save each rdd element (file) in hdfs using spark.
I would go about this a different way. Stream your transformed data back into Kafka, and then use the HDFS connector for Kafka Connect to stream the data to HDFS. Kafka Connect is part of Apache Kafka. The HDFS connector is open source and available standalone or as part of Confluent Platform.
Doing it this way you decouple your processing from writing your data to HDFS, which makes it easier to manage, to troubleshoot, to scale.

Spark streaming multiple Kafka topic to multiple Database table with checkpoint

I'm building spark streaming from Apache Kafka to our columnar DataBase.
To ensure fault tolerance I'm using HDFS checkpoint and write ahead log.
Apache Kafka topic -> spark streaming -> HDFS checkpoint-> spark SQl ( for messages manipulation)-> spark jdbc for our Db.
When I'm using spark jobs for one topic and table everything is working file.
I'm trying to stream in one spark job multiple Kafka topics and to write for multiple tables in here started the problem with the checkpoint ( which is per one topic table )
The problem is with checkpoints :(
1) If I will use "KafkaUtils.createDirectStream" with a list of topics and "groupBy" topic name but the checkpoint folder is one and if, for example, I will need to increase resources during the ongoing streaming (change amount of cors due to Kafka Lag) this will be impossible because today it's possible only if I delete the checkpoint folder and restart the spark job.
2) Use multiple spark streamingContext this I will try today and see if it works.
3) Multiple sparkStreaming with high-level consumers ( offset saved in kafka10...)
Any other ideas/solutions that I'm missing
Does structure streams with multiple Kafka topics and checkpoints behave differently?
Thx

Resources