I'm using the latest Spark 3.3.0 but still got the exception: java.io.FileNotFoundException: File does not exist: /LOG_DIR/application_1657344020931_1400038_1.inprogress
which contradicts https://github.com/apache/spark/pull/29350 , how could it be?
In LOG_DIR there are 70k eventlogs.
One SHS is running on nodeA with Spark2.4.5.
In order to resolve the exception,I run a new SHS one node B with Spark 3.3.0 ,but still got the exception.
The basic problem is that some finished spark applications randomly cannot be seen from the SHS.I think the FileNotFoundException is the main cause and the PR is aimed to resolve it.
I re-checked the whole FsHistoryProvider.checkForLogs(...) and found that I misunderstood the PR.
My question: SHS always shows less (COMPLETED) apps than the number of event logs(exculde inprogress) in my LOG_DIR
This situation will always happens because when SHS found an untracked log eg. aaa._inprogress, at the moment checkForLogs is writing the aaa._inprogress to LevelDB, Spark renames it to aaa, thus FileNotFoundException happens and the loop is aborted.
So the rest of the logs which may be 10% of all apps in the loop are missed. That's why my SHS less so many than the LOG_DIR.
THE PR above skip such problematic aaa event logs in this loop, but next time, since aaa is not in LevelDB, it will be tracked.
SHS still show less than the num of eventlogs, but just 40~70 apps in my cluster and these apps will show soon
Related
There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?
You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)
Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*
We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?
I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.
we are using Cassandra 1.2.9 + BAM 2.5 for API analysis.
We have scheduled a job to do cassandra data purge. This data purge job is divived into three steps.
The 1st step is to query the original column family and then insert them into the temporary columnFamily_purge.
The 2nd step is to delete from the orinal column family by adding tombstone,and insert the data from columnFamily_purge into the original column family.
The 3rd step is to drop the temporary columnFamily_purge
The 1st works well, but the 2nd step frequently crashes the cassandra servers during Hadoop map tasks,which makes Cassandra unavailable.The exception stacktrack is as follows:
2016-08-23 10:27:43,718 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName hadoop for UID 47338 from the native implementation
2016-08-23 10:27:43,720 WARN org.apache.hadoop.mapred.Child: Error running child
me.prettyprint.hector.api.exceptions.HectorException: All host pools marked down. Retry burden pushed out to client.
at me.prettyprint.cassandra.connection.HConnectionManager.getClientFromLBPolicy(HConnectionManager.java:390)
at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:244)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:113)
at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:243)
at me.prettyprint.cassandra.service.template.AbstractColumnFamilyTemplate.deleteRow(AbstractColumnFamilyTemplate.java:173)
at org.wso2.carbon.bam.cassandra.data.archive.mapred.CassandraMapReduceRowDeletion$RowKeyMapper.map(CassandraMapReduceRowDeletion.java:246)
at org.wso2.carbon.bam.cassandra.data.archive.mapred.CassandraMapReduceRowDeletion$RowKeyMapper.map(CassandraMapReduceRowDeletion.java:139)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Could someone help on this what may lead to this problem? Thanks!
This can happen due to 3 reasons.
1) Cassandra servers are down. I don't thing this is the case in your setup.
2) Network issues
3) The load is higher than what cluster can handle.
How do you delete data? Using a hive script?
After I increase the number of open files and max thread number,the problem is gone.
We have multiple run-forever streaming jobs generating huge eventLogs. These in-progress logs won't be removed until reach the the max age config (spark.history.fs.cleaner.maxAge).
Based on the Spark source code, "Only completed applications older than the specified max age will be deleted." https://github.com/apache/spark/blob/a45647746d1efb90cb8bc142c2ef110a0db9bc9f/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
So, in-progress eventLog will never be removed before completion and they are eating up space. Anyone have idea how to prevent it?
We have option like script periodically removes old files, but it will be our last resort, and we cannot modify the source code, but configuration.
I'm learning Spark, and quite often I have some issue that causes tasks and stages to fail. With my default configuration, there are rounds of retries and a bunch of ERROR messages to that effect.
While I totally appreciate the idea of retrying tasks when I finally get to production, I'd love to know how to make my application fail at the first sign of trouble so that I can avoid all the extra noise in the logs and within the application history itself. For example, if I run it out of memory, I'd love to just see the OOM exception near the end of my log and have the whole app fail.
What's the best way to setup configs for this kind of workflow?
You can set spark.task.maxFailures to 1.
spark.task.maxFailures is the number of individual task failures before giving up on the job, and its default value is 4.