Apache Spark configuration for multipart checkpoint files - apache-spark

Is it possible to configure Delta Lake Checkpoint Creation on Apache Spark in order to achieve parallelization of the Checkpoint Parquet File?
I have attempted to adjust configuration options such as spark.databricks.delta.checkpoint.partSize and spark.databricks.delta.snapshotPartitions, however I have been unsuccessful in achieving parallelization. It has been noted that this issue has been recently resolved (https://github.com/delta-io/delta/issues/837).
spark.databricks.delta.checkpoint.partSize = n is the limit at which we will start parallelizing the checkpoint. We will attempt to write maximum of this many actions per checkpoint.
spark.databricks.delta.snapshotPartitions is the number of partitions to use for state reconstruction
Would you be able to offer me some guidance on how to set up the value of these options? I found the documentation to be somewhat limited.

Related

Best deduplication strategy to be used with spark

What is the best de-duplication strategy to be used with spark?
I have a Kafka source that is continuously fed with structured information (say JSON) from various producers continuously.
I am having an HDInsight spark cluster that can pick messages in real time for this Kafka source, process them and put it into a destination Kafka source in real time.
My use case demands that the information received from the source may have duplicates which need to be eliminated. The duplicates have to be be checked against say last 24 hours.
My attempt :
I tried using the .dropduplicate method in spark along with watermarking , but I think it's not the best thing to do since the data for a single day window may exceed 50 GB in my use case.
I also looked for bloom filter implementation which can be used with spark but couldn't find a good one.
My question:
What are the possible approaches to eliminate duplication in general for large scale spark streaming application.?
Which of these features can be used along with HDInsight clusters on Azure ?
What are the fault tolerance capability in such services ?

S3 based streaming solution using apache spark or flink

We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.
Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?
My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?
If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.
ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.
I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.
This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.
https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html

PySpark: How to speed up sqlContext.read.json?

I am using below pyspark code to read thousands of JSON files from an s3 bucket
sc = SparkContext()
sqlContext = SQLContext(sc)
sqlContext.read.json("s3://bucknet_name/*/*/*.json")
This takes a lot of time to read and parse JSON files(~16 mins). How can I parallelize or speed up the process?
The short answer is : It depends (on the underlying infrastructure) and the distribution within data (called the skew which only applies when you're performing anything that causes a shuffle).
If the code you posted is being run on say: AWS' EMR or MapR, it's best to optimize the number of executors on each cluster node such that the number of cores per executor is from three to five. This number is important from the point of reading and writing to S3.
Another possible reason, behind the slowness, can be the dreaded corporate proxy. If all your requests to the S3 service are being routed via a corporate proxy, then the latter is going to be huge bottleneck. It's best to bypass proxy via the NO_PROXY JVM argument on the EMR cluster to the S3 service.
This talk from Cloudera alongside their excellent blogs one and two is an excellent introduction to tuning the cluster. Since we're using sql.read.json the underlying Dataframe will be split into number of partitions given by the yarn param sql.shuffle.paritions described here. It's best to set it at 2 * Number of Executors * Cores per Executor. That will definitely speed up reading, on a cluster whose calculated value exceeds 200
Also, as mentioned in the above answer, if you know the schema of the json, it may speed things up when inferSchema is set to true.
I would also implore you to look at the Spark UI and dig into the DAG for slow jobs. It's an invaluable tool for performance tuning on Spark.
I am planning on consolidating as many infrastructure optimizations on AWS' EMR into a blog. Will update the answer with the link once done.
There are at least two ways to speed up this process:
Avoid wildcards in the path if you can. If it is possible, provide a full list of paths to be loaded instead.
Provide the schema argument to avoid schema inference.

Support for Parquet as an input / output format when working with S3

I've seen a number of questions describing problems when working with S3 in Spark:
Spark jobs finishes but application takes time to close
spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0
Writing Spark checkpoints to S3 is too slow
many specifically describing issues with Parquet files:
Slow or incomplete saveAsParquetFile from EMR Spark to S3
Does Spark support Partition Pruning with Parquet Files
is Parquet predicate pushdown works on S3 using Spark non EMR?
Huge delays translating the DAG to tasks
Fast Parquet row count in Spark
as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.
Am I into something here? Can anyone provide an authoritative answer explaining:
Current state of the Parquet support with focus on S3.
Can Spark (SQL) fully take advantage of Parquet features like partition pruning, predicate pushdown (including deeply nested schemas) and Parquet metadata Do all of these feature work as expected on S3 (or compatible storage solutions).
Ongoing developments and opened JIRA tickets.
Are there any configuration options which should be aware of when using these three together?
A lot of the issues aren't parquet specific, but that S3 is not a filesystem, despite the APIs trying to make it look like this. Many nominally-low cost operations take multiple HTTPS requests, with the consequent delays.
Regarding JIRAs
HADOOP-11694; S3A phase II —everything you will get in Hadoop 2.8. Much of this is already in HDP2.5, and yes, it has significant benefits.
HADOOP-13204: the todo list to follow.
Regarding spark (and hive), the use of rename() to commit work is a killer. It's used at the end of tasks and jobs, and in checkpointing. The more output you generate, the longer things take to complete. The s3guard work will include a zero-rename committer, but it will take care and time to move things to it.
Parquet? pushdown works, but there are a few other options to speed things up. I list them and others in:
http://www.slideshare.net/steve_l/apache-spark-and-object-stores

How to enable lineage-based fault tolerance for Spark-Tachyon integration?

I am trying to implement RDD/Dataframe sharing using Tachyon. It is my understanding that with HDFS underFS, writes are asynchronous (with replication to HDFS happening behind the scene) and therefore should be faster but in my testing I see that Tachyon with HDFS underFS is 2-6 times slower at writing.
From this Tachyon paper I see that:
"We made [lineage-based fault tolerance] configurable in our Spark and MapReduce integration"
How do you enable Spark to use lineage-based fault tolerance in Tachyon?
Note: I am using the Spark Dataframe method, df.write.parquet, and the RDD method, rdd.saveAsObjectFile, to save my Dataframes/RDDs to Tachyon.
You should set tachyon.user.lineage.enabled to true and adjust other lineage settings according to your preferences. Some of the most interesting settings (from the Master Configuration docs):
tachyon.master.lineage.checkpoint.interval.ms - The interval (in milliseconds) between Tachyon's checkpoint scheduling.
tachyon.master.lineage.checkpoint.class - The class name of the checkpoint strategy for lineage output files. The default strategy is to checkpoint the latest completed lineage, i.e. the lineage whose output files are completed.
tachyon.master.lineage.recompute.interval.ms - The interval (in milliseconds) between Tachyon's recompute execution. The executor scans the all the lost files tracked by lineage, and re-executes the corresponding jobs. every 10 minutes.
See Lineage API docs for more details.

Resources