Elasticsearch write using Spark - apache-spark

I'm creating a document collection in Spark as an RDD and using the Spark read/write library from Elasticsearch. The Cluster that creates the collection is large so when it writes to ES I get the errors below indicating ES is overloaded, which does not surprise me. This does not seem to fail the job. The tasks may being retried and eventually succeed. In the Spark GUI the job is reported as having finishing successfully.
is there a way to somehow throttle the ES writing lib to avoid the retries (I can't change the cluster size)?
Do these errors mean that some data was not written to the index?
Here is one of many reported task failure errors, but again no job failure is reported:
2017-03-20 10:48:27,745 WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:112)
at org.apache.spark.scheduler.Task.run(Task.scala:102)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The lib I'm using is
org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.2"

Can you follow this link - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
In the Spark conf property or in your elastic search property you need to increase the max number of records which can be dumped in a single post, and that should solve your problem.

Related

GCS does not write all records in Spark3

I have seen several threads related to this but I found that mostly the issue is with AWS s3 and not Azure or GCS. I have a situation where I am running dataproc cluster and writing results in parquet table backed by GCS bucket.
Now, the behavior of GCS so far has been inconsistent. It sometimes writes all records and sometimes misses few records (not files, it's records). Like if I am writing 43000 records, it will write about 42745 records something. The reason I mentioned it as records because it produces 100 files of equal size when correctly written and it still has all 100 files and if it was missing single file, it should have missed about 4000 records. The data is equally distributed. Also, when I rerun the job, it sometimes writes all records, or sometimes writes different number of records, i.e. 42985 for example.
Everytime this happens, I have noticed a stacktrace in spark job for that specific hour like below. Also, this doesn't cause the job to fail. It just gives this stacktrace but the job status turns out as success after spark-sql query.
22/11/22 00:59:13 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 55.0 in stage 2.0 (TID 255) (cluster-sample-w-3.c.network.internal executor 3): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:296)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:505)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:508)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.FileNotFoundException: listStatus(hadoopPath: gs://<some_bucket>/hive/warehouse/<some_db>.db/<some_table>/data/_temporary/0/_temporary/attempt_202211220058563982258671276457664_0002_m_000055_255/dt=20221111/hr=01): 'gs://<some_bucket>/hive/warehouse/<some_db>.db/<some_table>/data/_temporary/0/_temporary/attempt_202211220058563982258671276457664_0002_m_000055_255/dt=20221111/hr=01' does not exist.
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:865)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:529)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:538)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:538)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergeDirectory(FileOutputCommitter.java:538)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:501)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:653)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:616)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:79)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:280)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:286)
... 9 more
Caused by: java.io.FileNotFoundException: Item not found: gs://<somebucket>/hive/warehouse/<some_db>.db/<some_table>/data/_temporary/0/_temporary/attempt_202211220058563982258671276457664_0002_m_000055_255/dt=20221111/hr=01
at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.listFileInfo(GoogleCloudStorageFileSystem.java:1039)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.listStatus(GoogleHadoopFileSystemBase.java:856)
... 26 more
This is happening across multiple tables and randomly. So, it does brings question if GCS writes are consistent in Spark? I did read this part where it says Cloud storage is not drop in replacement for HDFS, but then what's the alternative to solve such random behavior.
Environment:
GCS bucket:
Spark 3.1.3
Scala: 2.12.14
Dataproc Image: 2.0-rocky8
GCS Hadoop connector: gcs-connector-hadoop3-2.2.8.jar
Hadoop 3.2.3
Source code repository https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r c87f29d51bb88311d1adba1bc5bd7dfdfa345ebc
Compiled by bigtop on 2022-11-01T20:07Z
Compiled with protoc 2.5.0

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.
Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources
Just for other people facing similar issue.
In my situation, sometimes the same error happened when there's multiple Spark actions in one cell of a Databricks notebook.
Surprisingly, spliting the cell before the code where the error occurred or simply inserting time.sleep(5) there worked for me. However I'm not sure why it worked...
For example:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened

Spark staging directory race condition when writing to hive partition?

I am seeing intermittent exceptions when attempting to write a Dataset to a partition in a hive table.
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: /user/hive/warehouse/devl_fr9.db/fr9_ftdelivery_cpy_2_4d8eebd3_9691_47ce_8acc_b2a5123dabf6/.spark-staging-d996755c-eb81-4362-a393-31e8387104f0/date_id=20180604/part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000.snappy.parquet for client 10.56.219.20 already exists
If I check HDFS the relevant path does not exist. I can only assume this is some race condition regarding temp staging files. I am using Spark 2.3
A possible reason for this issue is that, during a job's execution, a task started writing data to that file and failed.
When a task fails, the data that it had already written is not deleted/purged by Spark (confirmed at least in 2.3 and 2.4). Therefore, when a different executor attempts to re-execute the failed task it will attempt to write to a file with the same name, and you'll get a FileAlreadyExistsException.
In your case, the file that already exists is called part-00000-d996755c-eb81-4362-a393-31e8387104f0.c000, so it's likely that you have a log message in stderr indicating that task 00000 was lost due to failure, something like
WARN TaskSetManager: Lost task **00000** in stage...
If you fix the reason for this failure - probably an OutOfMemoryError, if the issue is intermitent - the FileAlreadyExistsException will likely be solved because the task will not fail and leave temporary files behind.

Spark code to protect from FileNotFoundExceptions?

Is there a way to run my spark program and be shielded from files
underneath changing?
The code starts by reading a parquet file (no errors during the read):
val mappings = spark.read.parquet(S3_BUCKET_PATH + "/table/mappings/")
It then does transformations with the data e.g.,
val newTable = mappings.join(anotherTable, 'id)
These transformations take hours (which is another problem).
Sometimes the job finishes, other times, it dies with the following similar message:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 6 in stage 1014.0 failed 4 times, most recent failure: Lost task
6.3 in stage 1014.0 (TID 106820, 10.127.251.252, executor 5): java.io.FileNotFoundException: No such file or directory:
s3a://bucket1/table/mappings/part-00007-21eac9c5-yyzz-4295-a6ef-5f3bb13bed64.snappy.parquet
We believe another job is changing the files underneath us, but haven't been able to find the culprit.
This is a very complicated problem to solve here. If the underlying data changes while you are operating on the same dataframe the spark job will fail. The reason is when the dataframe was created the underlying RDD knew the location of the data and the DAG associated with it. Now if the underlying data suddenly changed by some job , RDD has no option but fail it.
One possibility of enable retry ,speculation etc but nevertheless the problem exists. Generally if you have a table in parquet and you want to read write at the same time, partition the table by date or time and then write will happen in the different partition while reading will happen in different partition.
Now with the problem of join taking long time. If you are reading the data from s3 then join and write back to s3 again the performance will be slower. Because now the hadoop needs to fetch the data from s3 first then perform the operation ( code not going to data ). Although the network call is fast, I ran some experiment with s3 vs EMR FS and found 50% slowdown with s3.
One alternative is to copy the data from s3 to HDFS and then run the join. That will shield you from the data overwriting and the performance will be faster.
One last thing if you are using spark 2.2 s3 write is painfully slow due to deprecation of DirectOutputCommiter. So that could be another reason for slowdown

Apache Spark: Master removed our application: Failed when using saveAsTextFile on large RDD

I'm loading a file of 20 GB in Spark Standalone mode on a machine with 4GB RAM and 2 cores, do some processing and then try to save the result (for testing purposes) to a text file using saveAsTextFile.
If I manually extract only a few thousand lines from the original input file and run the code on that, it works like a charm, resulting in the expected part-xxxxx files.
However, if I provide the whole 20GB file as input, it will start off fine, then hang somewhere along the process and when let run overnight it will have failed in the morning with the following message:
Py4JJavaError: An error occurred while calling o219.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Does anyone have an idea why this might be?
There might be a few issues here, but the most commons would be :
Too many file opens killed the process
The serializer you're using (by default the Java serializer) might create a GC Overhead - try switching to Kryo Serializer (c.f. Spark's documentation : Tuning page)
finally you're running out of disk.
Now without any idea of what your computation is or the logs from the actual server, it's difficult to tell what happened.

Resources