How to prevent eventLog file of Spark stream jobs eating up space?

How to prevent eventLog file of Spark stream jobs eating up space? - apache-spark

We have multiple run-forever streaming jobs generating huge eventLogs. These in-progress logs won't be removed until reach the the max age config (spark.history.fs.cleaner.maxAge).
Based on the Spark source code, "Only completed applications older than the specified max age will be deleted." https://github.com/apache/spark/blob/a45647746d1efb90cb8bc142c2ef110a0db9bc9f/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala
So, in-progress eventLog will never be removed before completion and they are eating up space. Anyone have idea how to prevent it?
We have option like script periodically removes old files, but it will be our last resort, and we cannot modify the source code, but configuration.

Related

Netsuite Map Reduce yielding

I read in documentation that soft limits on governance cause map reduce scripts to yield and reschedule. My problem is I cannot see in docs where it explains what happens in the yield. Is the getInputData called again to regather the same data set ok to be mapped or is the initial data set persisted somewhere and already mapped and reduced records are Excluded from processing?

With yielding, the getInputData stage is not called again. From the docs;
If a job monopolizes a processor for too long, the system can
naturally finish the job after the current map or reduce function has
completed. In this case, the system creates a new job to continue
executing remaining key/value pairs. Based on its priority and
submission timestamp, the new job either starts right after the
original job has finished, or it starts later, to allow
higher-priority jobs processing other scripts to execute. For more
details, see SuiteScript 2.0 Map/Reduce Yielding.
This is different from server restarts or interruptions, however.

How to debug a slow PySpark application

There may be an obvious answer to this, but I couldn't find any after a lot of googling.
In a typical program, I'd normally add log messages to time different parts of the code and find out where the bottleneck is. With Spark/PySpark, however, transformations are evaluated lazily, which means most of the code is executed in almost constant time (not a function of the dataset's size at least) until an action is called at the end.
So how would one go about timing individual transformations and perhaps making some parts of the code more efficient by doing things differently where necessary and possible?

You can use Spark UI to see the execution plan of your jobs and time of each phase of them. Then you can optimize your operations using that statistics. Here is a very good presentation about monitoring Spark Apps using Spark UI https://youtu.be/mVP9sZ6K__Y (Spark Sumiit Europe 2016, by Jacek Laskowski)

Any job troubleshooting should have the below steps.
Step 1: Gather data about the issue
Step 2: Check the environment
Step 3: Examine the log files
Step 4: Check cluster and instance health
Step 5: Review configuration settings
Step 6: Examine input data
From the Hadoop Admin perspective, Spark long-running job basic troubleshooting. Go to RM > Application ID.
a) Check for AM & Non-AM Preempted. This can happen if more that required memory is assigned either to driver or executors which can get preempted for a high priority job/YARN queue.
b) Click on AppMaster url. Review Environment variables.
c) Check Jobs section, review Event timeline. Check if executors are getting started immediately after driver or taking time.
d) If Driver process is taking time, see if collect()/ collectAsList() is running on driver as these method tends to take time as they retrieve all the elements of the RDD/DataFrame/Dataset (from all nodes) to the driver node.
e) If no issue in event timeline, go to the incomplete task > stages and check Shuffle Read Size/Records for any Data Skewness issue.
f) If all tasks are complete and still Spark job is running, then go to Executor page > Driver process thread dump > Search for driver. And lookout for operation the driver is working on. Below will be NameNode operation method we can see there (if any).
*getFileInfo()
getFileList()
rename()
merge()
getblockLocation()
commit()*

spark embedded (same JVM) for <400k text file process

Can we do a mini spark job embedded in our app? Any example? Reason : want to process part of a file and give results quicker than submitting a regular job. File is only 500 lines. But do not want to keep 2 code bases - just the one that is used for the large files too. File size is less than an MB.
I want to process the file in the same JVM that my client code is running. Want a single executor kicked off from within same JVM, via a flag in the config. (so a few jobs will have this flag set and others wont. Those that do not will run as usual on the cluster.)

How to tune "spark.rpc.askTimeout"?

We have a spark 1.6.1 application, which takes input from two kafka topics and writes the result to another kafka topic. The application receives some large (approximately 1MB) files in the first input topic and some simple conditions from the second input topic. If the condition is satisfied, the file is written to output topic else held in state (we use mapWithState).
The logic works fine for less (few hundred) number of input files, but fails with org.apache.spark.rpc.RpcTimeoutException and recommendation is to increase spark.rpc.askTimeout. After increasing from default (120s) to 300s the ran fine longer but crashed with the same error after 1 hour. After changing the value to 500s, the job ran fine for more than 2 hours.
Note: We are running the spark job in local mode and kafka is also running locally in the machine. Also, some time I see warning "[2016-09-06 17:36:05,491] [WARN] - [org.apache.spark.storage.MemoryStore] - Not enough space to cache rdd_2123_0 in memory! (computed 2.6 GB so far)"
Now, 300s seemed large enough a timeout considering all local configuration. But any idea, how to come up to an ideal timeout value instead of just using 500s or higher based on testing, as I see crashed cases using 800s and cases suggesting to use 60000s?

I was facing the same problem, I found this page saying that under heavy workloads it is wise to set spark.network.timeout(which controls all the timeouts, also the RPC one) to 800. At the moment it solved my problem.

Best practice beanstalkd (queue) and node.js

I currently do service using beanstalkd and node.js.
I would like when jobs fail, retry n time before give up the job.
If the job succede i want do it the same job 10 time.
So, what is the best practice, stock in mongo db with the jobId the error and success count, or delete and put a new job with a an error and success count in the body.
I dont know if i'm clear? so tell me , thanks a lot

There is a stats-job <id>\r\n that should also be available via the API library that returns, among other things, how many times the specific job has been reserved, released, buried, and so on.
This allows for a number of retries of failed jobs by checking previous reservation/releases.
To run the same job multiple times, I would personally create either one additional job, with a success count that would then be incremented (into another new job) - or, all nine new jobs, with optional delays before they start.

You have a couple of ways to do this:
you can release the job, and obtain from stats the number of reserves
you can put a new job with a retry count, and keep track of history in the data payload
You should do the later, and you don't need MongoDB as a second dependency.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to prevent eventLog file of Spark stream jobs eating up space? - apache-spark

Related

Netsuite Map Reduce yielding

How to debug a slow PySpark application

spark embedded (same JVM) for <400k text file process

How to tune "spark.rpc.askTimeout"?

Best practice beanstalkd (queue) and node.js

Categories

Resources