Databricks Checksum error while writing to a file - apache-spark

I am running a job in 9 nodes.
All of them are going to write some information to files doing simple writes like below:
dfLogging.coalesce(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)
However I am receiving this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling
o106.save. : java.util.concurrent.ExecutionException:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task
1.0 in stage 14.0 (TID 259, localhost, executor driver): org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/dbfs/delta/Logging/_delta_log/00000000000000000063.json at 0
exp: 1179219224 got: -1020415797
It looks to me, that because of concurrency, spark is somehow failing and it generates checksum errors.
Is there any known scenario that may be causing it?

So there are a couple of things going on and it should explain why coalesce may not work.
What coalesce does is it essentially combines the partitions across each worker. For example, if you have three workers, you can perform coalesce(3) which would consolidate the partitions on each worker.
What repartition does is it shuffles the data to increase/decrease the number of total partitions. In your case, if you have more than one worker and if you need a single output, you would have to use repartition(1) since you want the data to be on a single partition before writing it out.
Why coalesce would not work?
Spark limits the shuffling during coalesce. So you cannot perform a full shuffle (across different workers) when you are using coalesce, whereas you can perform a full shuffle when you are using repartition, although it is an expensive operation.
Here is the code that would work:
dfLogging.repartition(1).write.format('delta').mode('append').save('/dbfs/' + loggingLocation)

Related

Spark breaks when you need to make a very large shuffle

I'm working with 1 terabytes of data, and at a moment I need to join two smaller dataframes, I don't know the size, but it has more than 200 GB and I get the error below.
The break occurs in the middle of the operation after 2 hours.
It seems to me to be a memory stick, but that doesn't make sense, because looking at the UI Spark Ganglia, the RAM memory doesn't reach the limit as shown in the print below.
Does anyone have any idea how I can solve this without decreasing the amount of data analyzed.
My cluster has:
1 x master node n1-highmem-32
4 x slave node n1-highmem-32
[org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 482.1 failed 4 times, most recent failure: Lost task 3.3 in stage 482.1 (TID 119785, 10.0.101.141, executor 1): java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
Caused by: java.io.FileNotFoundException: /tmp/spark-83927f3e-4511-1b/3d/shuffle_248_72_0.data.f3838fbc-3d38-4889-b1e9-298f743800d0 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)][1]
This types of errors typically occur when there are deeper problems with some tasks, like significant data skew. Since you don't provide enough details (please be sure to read How To Ask and How to create a Minimal, Complete, and Verifiable example) and job statistics the only approach that I can think off is to significantly increase number of shuffle partitions:
´´´
sqlContext.setConf("spark.sql.shuffle.partitions", 2048)
´´´

Kryo serialization failed: Buffer overflow

We read data present in hour format present in S3 through spark in scala.For example,
sparkSession
.createDataset(sc
.wholeTextFiles(("s3://<Bucket>/<key>/<yyyy>/<MM>/<dd>/<hh>/*"))
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
Doing the above slice and dice (like replace and split)to convert the pretty json data in form of json lines(one json record per json).
Now I am getting this error while running on EMR:
Job aborted due to stage failure: Task 1 in stage 11.0 failed 4 times, most recent failure: Lost task 1.3 in stage 11.0 (TID 43, ip-10-0-2-22.eu-west-1.compute.internal, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 1148334. To avoid this, increase spark.kryoserializer.buffer.max value.
I have tried increasing the value for kyro serializer buffer --conf spark.kryoserializer.buffer.max=2047m but still I am getting this error for reading data for some hour locations like hours 09,10 and for other hours it is reading fine.
I wanted to ask how to remove this error and whether I need to add something else in spark configurations like change number of partitions?Thanks

Spark Structured streaming - java.lang.OutOfMemoryError: Java heap space

I am getting the below exception when processing input streams using Spark structured streaming.
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 22 in stage 5.0 failed 1 times, most recent failure: Lost task
22.0 in stage 5.0 (TID 403, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
I have handled watermark as given below,
.withWatermark("timestamp", "5 seconds")
.groupBy(window($"timestamp", "1 second"), $"column")
What could be the issue? I have tried changing the trigger from default to fixed interval but still I am still facing the problem.
I don't believe this issue is related to watermarks or triggers. OutOfMemory errors occur due to two reasons:
Memory Leaks. This programming error will lead your application to constantly consume more memory. Every time the leaking functionality of the application is used it leaves some objects behind into the Java heap space. Over time the leaked objects consume all of the available Java heap space and trigger the error.
Too much data for the resources designated to it. Your cluster has a designated threshold and can only hold a certain amount of data. When the volume of data exceeds that threshold, the job which functioned normally before the spike ceases to operate and triggers the java.lang.OutOfMemoryError: Java heap space error.
Your error says task 22.0 in stage 5.0 as well which means that it completed stages 1 - 4 successfully. To me, that signifies that there was too much data for the resources designated to it as it did not die over multiple runs as it would with a memory leak. Try limiting the amount of data being read in with something like spark.readStream.option("maxFilesPerTrigger", "6")or increasing the memory assigned to that cluster.

How does hive on spark determine reducer number?

I enable Hive on Spark according to Cloudera documentation 1 and 2. I now find that reducer number behaves unexpectedly. I wish someone could provide detailed documentation or explanation regarding that.
As far as I know, Hive on MR calculates reducer number based on data volume and hive.exec.reducers.bytes.per.reducer, which means bytes per reducer processes, hence job parallelism can be adjusted automatically. But Hive on Spark seems to treat this parameter differently. Though setting it to very low number (<1K) increases reducer number indeed, no common rule can be applied to different jobs.
Below is segment from Cloudera tuning documentation for parallelism.
Adjust hive.exec.reducers.bytes.per.reducer to control how much data each reducer processes, and Hive determines an optimal number of partitions, based on the available executors, executor memory settings, the value you set for the property, and other factors. Experiments show that Spark is less sensitive than MapReduce to the value you specify for hive.exec.reducers.bytes.per.reducer, as long as enough tasks are generated to keep all available executors busy
Also, I understand that RDD in Spark spills data on disk when memory is not sufficient. If that, the following error messages from Hive on Spark jobs really confuse me.
Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 146, fuxi-luoge-105, executor 34): ExecutorLostFailure (executor 34 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.2 GB of 6.0 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

Spark code to protect from FileNotFoundExceptions?

Is there a way to run my spark program and be shielded from files
underneath changing?
The code starts by reading a parquet file (no errors during the read):
val mappings = spark.read.parquet(S3_BUCKET_PATH + "/table/mappings/")
It then does transformations with the data e.g.,
val newTable = mappings.join(anotherTable, 'id)
These transformations take hours (which is another problem).
Sometimes the job finishes, other times, it dies with the following similar message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 1014.0 failed 4 times, most recent failure: Lost task
6.3 in stage 1014.0 (TID 106820, 10.127.251.252, executor 5): java.io.FileNotFoundException: No such file or directory:
s3a://bucket1/table/mappings/part-00007-21eac9c5-yyzz-4295-a6ef-5f3bb13bed64.snappy.parquet
We believe another job is changing the files underneath us, but haven't been able to find the culprit.
This is a very complicated problem to solve here. If the underlying data changes while you are operating on the same dataframe the spark job will fail. The reason is when the dataframe was created the underlying RDD knew the location of the data and the DAG associated with it. Now if the underlying data suddenly changed by some job , RDD has no option but fail it.
One possibility of enable retry ,speculation etc but nevertheless the problem exists. Generally if you have a table in parquet and you want to read write at the same time, partition the table by date or time and then write will happen in the different partition while reading will happen in different partition.
Now with the problem of join taking long time. If you are reading the data from s3 then join and write back to s3 again the performance will be slower. Because now the hadoop needs to fetch the data from s3 first then perform the operation ( code not going to data ). Although the network call is fast, I ran some experiment with s3 vs EMR FS and found 50% slowdown with s3.
One alternative is to copy the data from s3 to HDFS and then run the join. That will shield you from the data overwriting and the performance will be faster.
One last thing if you are using spark 2.2 s3 write is painfully slow due to deprecation of DirectOutputCommiter. So that could be another reason for slowdown

Resources