Databricks is always auto starting - databricks

I can't put Databricks to idle because something is auto starting it. I have no idea how to check it. Here are some of the details I found so far...
Event Log
STARTING
2020-12-28 13:07:42 PST
Started by b161878e-ff38-4fca-924a-b5c5ce4c3193.
Running Applications
Application ID
- app-20201228051029-0000
Name
- Databricks Shell
Cores
4
Memory per Executor
- 7.1 GiB
Resources Per Executor
-
Submitted Time
- 2020/12/28 05:10:29
User
- root
State
- RUNNING
Duration
- 27 s
Tried killing this but got Method Not Allowed.
No running Jobs
No attached Notebooks
I have limited Databricks knowledge, and I don't know how to check who started it. Any help is very much appreciated.
Added event logs details.
Event during termination because of inactivity.
{
"reason": {
"code": "INACTIVITY",
"type": "SUCCESS",
"parameters": {
"inactivity_duration_min": "60"
}
}
}
After a few seconds, it's going to start again.
{
"user": "b161878e-ff38-4fca-924a-b5c5ce4c3193"
}

Related

Controlling the number of spark drivers running on Mesos

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks
As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.
Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

Umbraco 7.6.0 - Site becomes unresponsive for several minutes every day

We've been having a problem for several months where the site becomes completely unresponsive for 5-15 minutes every day. We have added a ton of request logging, enabled DEBUG logging, and have finally found a pattern: Approximately 2 minutes prior to the outages (in every single log file I've looked at, going back to the beginning), the following lines appear:
2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Timer: release. 2017-09-26 15:13:05,652 [P7940/D9/T76] DEBUG
Umbraco.Web.PublishedCache.XmlPublishedCache.XmlCacheFilePersister -
Run now (sync).
From what I gather this is the process that rebuilds the umbraco.config, correct?
We have ~40,000 nodes, so I can't imagine this would be the quickest process to complete, however the strange thing is that the CPU and Memory on the Azure Web App do not spike during these outages. This would seem to point to the fact that the disk I/O is the bottleneck.
This raises a few questions:
Is there a way to schedule this task in a way that it only runs
during off-peak hours?
Are there performance improvements in the newer versions (we're on 7.6.0) that might improve this functionality?
Are there any other suggestions to help correct this behavior?
Hosting environment:
Azure App Service B2 (Basic)
SQL Azure Standard (20 DTUs) - DTU usage peaks at 20%, so I don't think there's anything there. Just noting for completeness
Azure Storage for media storage
Azure CDN for media requests
Thank you so much in advance.
Update 10/4/2017
If it helps, It appears that these particular log entries correspond with the first publish of the day.
I don't feel like 40,000 nodes is too much for Umbraco, but if you want to schedule republishes, you can do this:
You can programmatically call a cache refresh using:
ApplicationContext.Current.Services.ContentService.RePublishAll();
(Umbraco source)
You could create an API controller which you could call periodically by a URL. The controller would probably look something like:
public class CacheController : UmbracoApiController
{
[HttpGet]
public HttpResponseMessage Republish(string pass)
{
if (pass != "passcode")
{
return Request.CreateResponse(HttpStatusCode.Unauthorized, new
{
success = false,
message = "Access denied."
});
}
var result = Services.ContentService.RePublishAll();
if (result)
{
return Request.CreateResponse(HttpStatusCode.OK, new
{
success = true,
message = "Republished"
});
}
return Request.CreateResponse(HttpStatusCode.InternalServerError, new
{
success = false,
message = "An error occurred"
});
}
}
You could then periodically ping this URL:
/umbraco/api/cache/republish?code=passcode
I have a blog post on how you can read on how to schedule events like these to occur. I recommend just using the Windows Task Scheduler to ping the URL: https://harveywilliams.net/blog/better-task-scheduling-in-umbraco#windows-task-scheduler

Azure Function Multiple Job Hosts

I am experiencing a strange problem with Azure Functions that is starting up multiple job hosts. The initial host seems to startup and subsequent hosts error trying to acquire singleton lock. It is really noticeable when I disable one of the jobs and the error message appears that the "function runtime is unable to start". I noticed that my timer triggers were executing multiple times per their configuration "0 */30 * * * *" which caused me to dig deeper into this situation.
Pid 1 2017-04-25T13:30:06.680 Staging table updated successfully.
Pid 1 2017-04-25T13:30:06.680 Updating the base table from the staging table.
Pid 2 2017-04-25T13:30:06.680 Staging table updated successfully.
Pid 2 2017-04-25T13:30:06.680 Updating the base table from the staging table.
Details about the Function App:
- Azure Function running under the Dynamic/Consumption plan
- 5 functions running from class libraries (followed this guide - https://blogs.msdn.microsoft.com/appserviceteam/2017/03/16/publishing-a-net-class-library-as-a-function-app/)
- 2 functions are executed from a timer, every 30 minutes "0 */30 * * * *"
- 1 timer trigger disabled while waiting for development time
- 1 blob trigger watching a container for uploads from IoT Hub
- 1 EventHub trigger receiving events from IoT Hub (sparse events so no heavy load here)
Steps to reproduce:
- Stand up Azure function with Dynamic plan
- Create the Azure functions in the portal (ran into issues not doing this prior)
- Deploy the functions from VSTS, using WebDeploy from the guide above
- Make sure the functions tried to start
- Disable one of the functions to force a restart
- Error messages start displaying
Log pulled from the Function:
Link to log file
I have stopped the Azure Function App Service, removed the lock folder to see if that helps acquire singleton locks which it does, but as soon as a function is enabled/disabled or pushed from VSTS using the web deploy the errors return. I have rebuilt the Azure Function a couple of times and the outcome is still the same.
We are in the process of trying to understand how to troubleshoot this issue so we can create a monitoring process around this scenario.
Edit
The function that executed twice is setup with the following (all of the functions look very similar to this):
function.json
{
"scriptFile": "..\\bin\\IngestionFunctionClassLibrary.dll",
"entryPoint": "IngestionFunctionClassLibrary.Functions.AnalyticsUpdate.Run",
"bindings": [
{
"name": "myTimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "0 */30 * * * *"
}
],
"disabled": true`
}
project.json
{
"frameworks": {
"net46":{
"dependencies": {
}
}
}
}
Messages that look like Unable to acquire Singleton lock are actually not errors, but simply informational messages. What it means that your Function App was scaled out to multiple instances (in your case about 5). There are some lease resources that can intrinsically only be held by one instance (to support singleton behavior). So once an instance gets the lease, all others will display this message.

Service Fabric Application PackageDeployment Operation Time Out exception

i have service fabric cluster and 3 nodes are created in 3 systems and it is inter-connected. i am able to connect each of nodes. These nodes are created in windows server. These Windows Server(VMs) are on-premises.
Manually i am trying to deploy my package into my cluster/one of nodes, i am getting Operation Timeout exception. i have used below commands to execute for deployment.
Service Fabric Power shell Commands:
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath 'c:\sample\etc' -ApplicationPackagePathInImageStore 'abc.app.portaltype'
after execute above command it runs for 2 -3 mins and throws Operation Timeout exception. My package size is almost 250 MB and approximately 15000 file exist in my package. after that i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, then it successfully executed and it copied to service fabric imagestore.
Register-ServiceFabricApplicationType -ApplicationPathInImageStore 'abc.app.portaltype'
after executed Copy-ServiceFabricApplicationPackage command , i have executed above Register-ServiceFabricApplicationType command to register my in cluster.but it also throws Operation timeout exception then i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, but no luck it throws same operation timeout exception.
Just to make sure these operation Timeout issue because of no files in package or not. i have created simple empty service fabric asp.net core app and created package and try to deploy in same server with using above command, it deployed with in fraction of second and it works as smoothly.
Anybody has any idea how to over come service fabric operation timeout issue ?
How to handle the operation timeout issue if the package contains large set of files ?
Any help/suggestion would be very appreciated.
Thanks,
If this is taking longer than the 10 Minute default max it's probably one of the following issues:
Large application packages (>100s of MB)
Slow network connections
A large number of files within the application package (>1000s).
The following workarounds should help you.
Add the following settings to your cluster config:
"fabricSettings": [
{
"name": "NamingService",
"parameters": [
{
"name": "MaxOperationTimeout",
"value": "3600"
},
]
}
]
Also add:
"fabricSettings": [
{
"name": "EseStore",
"parameters": [
{
"name": "MaxCursors",
"value": "32768"
},
]
}
]
There’s a couple additional features which are currently rolling out. For these to be present and functional, you need to be sure that the client is at least 2.4.28 and the runtime of your cluster is at least 5.4.157. If you’re staying up to date these should already be present in your environment.
For register you can specify the -Async flag which will handle the upload asynchronously, reducing the need for the timeout to just the time necessary to send the command, not the application package. You can also query the status of the registration with Get-ServiceFabricApplicationType. 5.5 fixes some issues with these commands, so if they aren't working for you you'll have to wait for that release to hit your environment.

Error deploying webjob schedule - Response status code does not indicate success: 409 (Conflict)

I have a project with three scheduled webjobs. They all deploy correctly from Visual Studio, but it can't create a schedule for the third one. I get the following error:
webjobs.console.targets(110,5): Error : An error occurred while
creating the WebJob schedule: Response status code does not indicate
success: 409 (Conflict).
There's nothing special about my schedule in webjob-publish-settings.json:
{
"$schema": "http://schemastore.org/schemas/json/webjob-publish-settings.json",
"webJobName": "...",
"startTime": "2015-12-07T00:00:00-05:00",
"endTime": null,
"jobRecurrenceFrequency": "Day",
"interval": 1,
"runMode": "Scheduled"
}
I tried adding the schedule manually from the Azure portal and got a bit more information.
Job collection 'WebJobs-EastUS' reaches maximum number of jobs
allowed.
It turns out that you can only have 5 jobs per collection. This project has 3 jobs and two environments, so there are 6 in total. I created a new job schedule in a new collection, then deleted the job, and tried redeploying to see if it used the new empty collection. It did not, and I got the same error.
Next, I deleted a job in the original collection and redeployed. That time it worked fine. This isn't an ideal solution, since I'm still limited to 5 jobs when I need 6.
Is there a way to specify the job collection to use for the scheduler? Or is there something else I'm missing?
You can manage the scale of the Scheduler JobCollection used by your WebJobs in the old portal. Navigate to Scheduler/JobCollections and increase the scale on your Scheduler JobCollection to increase your job limit. This blog post shows where to find this stuff in the portal, and also details how WebJobs + Azure Scheduler work behind the scenes.
However, we highly recommend using the new inbuilt scheduling mechanism detailed in this blog post. This mechanism keeps the schedule with your job and involves no outside dependencies.

Resources