Spark event logs for long running process - apache-spark

I have a long spark process running and could see event logs generated taking huge space. When I investigated logs I could see apart from a regular job, stage, task event logs I can see several logs like
{"Event":"SparkListenerUnpersistRDD","RDD ID":415}.
I have made sure no manual unPesistRDD calls exist but I can still see many logs generated like this.
Is there a way with which I can disable unPersistRDD event logs?

Related

Celery task suddenly missing

I have a long running celery task which I can see it in flower at the beginning of execution but after some hours, it's like it doesn't exist! All I see in task details page is Unknown task '894a8b45-5963-40da-a104-7ffff98bc267' and I cannot find that key in Redis but I can tell the task is still running by looking at logs. The task does not fail, it just disappears!
flower keeps task's information in-memory. It is not related to your celery backend (Redis for example).
By default it keeps the last 10,000 tasks so if your system run many tasks it make sense that your task becomes older and being purged to make place for a newer tasks.
Assuming that's the case - you'll see the Unknown task message.
You can tune the number of tasks to keep in memory with max-tasks.

Azure worker role instance got stuck

I have an continuous running Worker role that executes multiple jobs. The jobs are there to process queue messages. Normally if there is an exception or any problem, the job will fail, the queued message will go back into the queue, and the job will try to reprocess.
But I am facing a weird issue since last month that no messages had processed in the past day or so. I investigated on the Azure Portal, and saw that the worker role instance still had a "running" status. For some reason, the job did not time out or quit, but all the messages was sitting in the queue, unprocessed.
There were also no logs or exceptions/errors thrown (I have a decent amount of logging and exception handling in the method).
I restarted the worker role via the Azure Portal, and once that happened, all of the backed up queue messages began processing immediately.
Can anyone help with the solutions or suggestions to handle this case?
RDP to the VM and troubleshoot it just like you would troubleshoot it on-prem. What do performance counters show you? Is your process (or any other) consuming CPU? Anything in the event logs? Take a hang dump of WaWorkerHost.exe and check the callstacks to see what your code is doing or if it is stuck in something like a deadlock or infinite loop.
You can also check the guest agent and host boostrapper logs (see https://blogs.msdn.microsoft.com/kwill/2013/08/09/windows-azure-paas-compute-diagnostics-data/), but since you said the portal was reporting that the instance was in the Ready state then I don't think you will find anything there. It sounds like 'Azure' (the role host processes) are working fine and it is something within WaWorkerHost.exe (your code) that is the problem.

Add Tasks to a running Azure batch job and manually control termination

We have an Azure-batch job that uses some quite large files which we are uploading to Azure Blob storage asynchronously so that we don't have to wait for all files to upload before starting our batch job made up of a collection of Tasks that will process each file and generate output. All good so far - this is working fine.
I'd like to be able to create an Azure Task and Add it to an existing, running Azure Job increasing the length of the Task list but I cant find how to do this. It seems that Azure expects you to define ALL jobs for a Task before the Job starts and then it runs until all tasks are complete and terminates the job (which makes sense in some scenarios - but not mine).
I would like to suppress this Job completion behavior and be able to queue up additional Azure Tasks for the same job. I could then monitor the Azure Job status (via the Tasks) and determine myself if the Job is complete.
Our issue is that uploads of multi-MB files takes time and we want Task processing to start as soon as the first file is available. If we have to wait until all files are available, then our processing start is delayed which is not what we need.
We 'could'create a job per task and manage it in our application but that is a little 'messy' and I would like to use the encapsulating Azure Job entity and supporting functionality if I possibly can.
Has anyone done this and can offer some guidance? Many thanks?
You can add new tasks to an existing Azure Batch job in the active state. There is no running state for an Azure Batch job. You can find a list of Azure Batch job states here.
Azure Batch Jobs, by default, do not automatically complete by terminating upon all tasks completing. You can view this related question regarding this subject.

Celery with dynamic workers

I am putting together a Celery based data ingestion pipeline. One thing I do not see anywhere in the documentation is how to build a flow where workers are only running when there is work to be done. (seems like a major flaw in the design of Celery honestly)
I understand Celery itself won't handle autoscaling of actual servers, thats fine, but when I simulate this Flower doesn't see the work that was submitted unless the worker was online when the task was submitted. Why? I'd love a world where I'm not paying for servers unless there is actual work to be done.
Workflow:
Imagine a While loop thats adding new data to be processed using the celery_app.send_task method.
I have custom code that sees theres N messages in the queue. It spins up a Server, and starts a Celery worker for that task.
Celery worker comes online, and does the work.
BUT.
Flower has no record of that task, even though I see the broker has a "message", and while watchings the output of the worker, I can see it did its thing.
If I keep the worker online, and then submit a task, it monitors everything just fine and dandy.
Anyone know why?
You can use celery autoscaling. For example setting autoscale to 8 will mean it will fire up to 8 processes to process your queue(s). It will have a master process sitting waiting though. You can also set a minimum, for example 2-8 which will have 2 workers waiting but fire up some more (up to 8) if it needs to (and then scale down when the queue is empty).
This is the process based autoscaler. You can use it as a reference if you want to create a cloud based autoscaler for example that fires up new nodes instead of just processes.
As to your flower issue it's hard to say without knowing your broker (redis/rabbit/etc). Flower doesn't capture everything as it relies on the broker doing that and some configuration causes the broker to delete information like what tasks have run.

Monitor node.js scripts running on ubuntu instance

I have a node.js script that run once in a day on ubuntu EC2 instance. This script pulls data from some hundered thousand remote APIs and save to our local database. Is there any way we can monitor this node.js script on remote server? There have been few instances where script crashed due to some reason and we were unable to figure it out without SSHing into instance and checking the logs. I have however created a small system after first few crashes which send us an email whenever script crashes due to some uncaught exception and also when script completes execution.
However, we need to develop a better system where we can monitor the progress of script via web interface of our admin application which is deployed over some other instance and also trigger start/stop of script via this interface. What are possible options for achieving this?
If you like to stay in Node.js, then there are several process monitoring tools:
PM2 comes with lots of other features besides monitoring processes. You can monitor your processes via CLI or their official web interface: https://keymetrics.io/. A quick search on npm also gives a bunch of nice unofficial gui tools: https://www.npmjs.com/search?q=pm2+web
Forever is not as feature rich as PM2 but will do the basic process operations and couple of gui are also available in npm.
There are two problems here that you are trying to solve:
Scheduling work to be done
Monitoring a process for failure
At a simple level, this is easy: schedule a cron job and restart failed things so they keep trying.
However, when things don't go smoothly, it helps to have a lot more granularity over what you are scheduling, and how it is executed. This would also give you the visibility over each little piece of work.
Adding a little more complexity, you can end up with something like this:
Schedule the script that starts everything (via cron, if that's comfortable)
That script generates several jobs that need to be executed into a queue
A worker process (or n worker processes) consume that queue and execute pending jobs
You can monitor both the progress of the jobs, as well as the state of each worker (# of crashes, failures, jobs completed, etc.). The other tools mentioned above are good candidates for this (forever, pm2, etc.)
When jobs fail, other workers can pick up the small piece of work that was in progress and restart it. This is much more efficient than restarting the entire process, and also lets you parallelize things across n workers based on how you can split up the workloads.
You could easily throw the status onto a web app so you can check in periodically rather than have to dig through server logs.
You can also get more intelligent with different types of failures. Network error? Retry 5 times. Rated limited? Gradual back-off. Crash? Don't retry and notify via email. etc
I have tried this with pm2, you can get the info of the task, then cat out or grab the log files. Or you could have a logging server, see also: https://github.com/papertrail/remote_syslog2

Resources