Service Fabric Application PackageDeployment Operation Time Out exception - azure

i have service fabric cluster and 3 nodes are created in 3 systems and it is inter-connected. i am able to connect each of nodes. These nodes are created in windows server. These Windows Server(VMs) are on-premises.
Manually i am trying to deploy my package into my cluster/one of nodes, i am getting Operation Timeout exception. i have used below commands to execute for deployment.
Service Fabric Power shell Commands:
Copy-ServiceFabricApplicationPackage -ApplicationPackagePath 'c:\sample\etc' -ApplicationPackagePathInImageStore 'abc.app.portaltype'
after execute above command it runs for 2 -3 mins and throws Operation Timeout exception. My package size is almost 250 MB and approximately 15000 file exist in my package. after that i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, then it successfully executed and it copied to service fabric imagestore.
Register-ServiceFabricApplicationType -ApplicationPathInImageStore 'abc.app.portaltype'
after executed Copy-ServiceFabricApplicationPackage command , i have executed above Register-ServiceFabricApplicationType command to register my in cluster.but it also throws Operation timeout exception then i have passed an extra parameter -TimeOutSec to 600(10mins) explicitly in above command, but no luck it throws same operation timeout exception.
Just to make sure these operation Timeout issue because of no files in package or not. i have created simple empty service fabric asp.net core app and created package and try to deploy in same server with using above command, it deployed with in fraction of second and it works as smoothly.
Anybody has any idea how to over come service fabric operation timeout issue ?
How to handle the operation timeout issue if the package contains large set of files ?
Any help/suggestion would be very appreciated.
Thanks,

If this is taking longer than the 10 Minute default max it's probably one of the following issues:
Large application packages (>100s of MB)
Slow network connections
A large number of files within the application package (>1000s).
The following workarounds should help you.
Add the following settings to your cluster config:
"fabricSettings": [
{
"name": "NamingService",
"parameters": [
{
"name": "MaxOperationTimeout",
"value": "3600"
},
]
}
]
Also add:
"fabricSettings": [
{
"name": "EseStore",
"parameters": [
{
"name": "MaxCursors",
"value": "32768"
},
]
}
]
There’s a couple additional features which are currently rolling out. For these to be present and functional, you need to be sure that the client is at least 2.4.28 and the runtime of your cluster is at least 5.4.157. If you’re staying up to date these should already be present in your environment.
For register you can specify the -Async flag which will handle the upload asynchronously, reducing the need for the timeout to just the time necessary to send the command, not the application package. You can also query the status of the registration with Get-ServiceFabricApplicationType. 5.5 fixes some issues with these commands, so if they aren't working for you you'll have to wait for that release to hit your environment.

Related

Blob-triggered Azure Function doesn't process only one blob at a time anymore

I have written a blob-triggered function that uploads data on a CosmosDB database using the Gremlin API, using Azure Functions version 2.0. Whenever the function is triggered, it is going to read the blob, extract relevant information, and then queries the database to upload the data on it.
However, when all files are uploaded on the blob storage at the same time, the Function is going to process all files at the same time, which results in too many requests for the database to handle. To avoid this, I ensured that the Azure Function would only process one file at a time, by setting the batchSize to 1 in the host.json file :
{
"extensions": {
"queues": {
"batchSize": 1,
"maxDequeueCount": 1,
"newBatchThreshold": 0
}
},
"logging": {
"applicationInsights": {
"samplingSettings": {
"isEnabled": true,
"excludedTypes": "Request"
}
}
},
"version": "2.0"
}
This worked perfectly fine for 20 files at a time.
Now, we are trying to process 300 files at a time, and this feature doesn't seem to work anymore, the Function processes all the files at the same time again, which results in the database not being able to handle all the requests.
What am I missing here ? Is there some scaling issue I'm not aware of ?
From here:
If you want to avoid parallel execution for messages received on one queue, you can set batchSize to 1. However, this setting eliminates concurrency as long as your function app runs only on a single virtual machine (VM). If the function app scales out to multiple VMs, each VM could run one instance of each queue-triggered function.
You need to combine this with the app setting WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT when you run in Consumption plan.
Or, according to the docs, the better way would be through the Function property functionAppScaleLimit: https://learn.microsoft.com/en-us/azure/azure-functions/event-driven-scaling#limit-scale-out
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT would work of course.
You can also scale to multiple Function App instances within one Host then you can have less hosts and more FUNCTIONS_WORKER_PROCESS_COUNT per host. Cost implications would depend on your plan.
Note that all workers within a Host would share resources, so this is recommended for more IO bound workload.

Controlling the number of spark drivers running on Mesos

We have a cluster with 6 EC2 nodes in AWS (16 cpu, 64 gb per node - 1 node running mesos master and 5 nodes running mesos slave). The mesos dispatcher is running on the master node. We use this cluster exclusively to run spark jobs. Our spark configuration is 5 cpu and 10 gb per executor (same for driver). I
In one of our scheduled jobs, we have a scenario where we need to do a few hundred spark-submits at the same time. When this happens, the dispatchers starts drivers for all these spark-submits leaving no room for any executors in the cluster.
I'm looking at a few options. I would like to get some pointers from members in the spark/mesos community.
Some options which I don't want to get into are : increasing the cluster size, asking the analysts to change their job structure to combine all spark-submits into a single one, switching to YARN/EMR (actually I tried this and got into some messy queue problems there)
Option 1 : Using Mesos roles
I found some documentation on the use of quotas and roles to solve this. But I'm not sure about the following :
How to create a mesos role and update the resources to be made available to
this role ?
How to set up separate roles for spark drivers and spark executors ?
By default all resources are in the * role. There is a flag called spark.mesos.role that I can set while doing spark-submit but not sure how to create this role and ensure this role is used only for executors ?
Option 2 : Modifying the mesos cluster scheduler
When spark-submit happens to mesos dispatcher, it adds the driver request to a WaitingQueue. When drivers fail while executing and if supervise mode is available, they are sent to a PendingRetryQueue with custom retry schedule settings. When resources are available from mesos, these drivers from the PendingRetryQueue are scheduled first and WaitingQueue are scheduled next.I was thinking of keeping the WaitingQueue with size 5 (spark.mesos.maxDrivers) and when there are more spark-submits than the queue size, I would add these drivers to the PendingRetryQueue and schedule them to run later. Currently, as per my understanding, when there are more that 200 drivers in the WaitingQueue mesos rest server sends a failed message and doesn't add it to the PendingRetryQueue.
Any help on implementing either of the options would be very helpful to me. Thanks in advance.
Update : Just saw that by when I give spark-submit with a role, it runs only executors in that role and drivers run in the default * role. I think this should solve this issue for me. Once I test this, I'll post my update here and close this. Thanks
As mentioned in the update, by default the mesos runs spark drivers in default role (*) and executors in the role provided by 'spark.mesos.role' parameter. To control the resources available for each role, we can use quotas , guarantees or reservations. We went ahead with static reservations since it suited our requirements.Thanks.
Option 1 is the good one.
First set dispatcher quota by creating a file like dispatcher-quota.json
cat dispatcher-quota.json
{
"role": "dispatcher",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 5.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 5120.0 }
}
]
}
Then push it to you're mesos master (leader) with
curl -d #dispatcher-quota.json -X POST http://<master>:5050/quota
So now you will have a quota for driver
Ensure you dispatcher is running with the right service role set if needed ajust it. If in DC/OS use
$ cat options.json
{
"service": {
"role": "dispatcher"
}
}
$ dcos package install spark --options=options.json
Otherwise feel free to share how you've deployed you dispatcher. I will provide you a how to guide.
That's ok for drivers. Now let's work with executor folowing the same way
$ cat executor-quota.json
{
"role": "executor",
"guarantee": [
{
"name": "cpus",
"type": "SCALAR",
"scalar": { "value": 100.0 }
},
{
"name": "mem",
"type": "SCALAR",
"scalar": { "value": 409600.0 }
}
]
}
$ curl -d #executor-quota.json -X POST http://<master>:5050/quota
Adapt values to you requirements
Then ensure to launch executor on with the correct role by providing
--conf spark.mesos.role=executor \
Source of my explanation came from https://github.com/mesosphere/spark-build/blob/master/docs/job-scheduling.md don't hesistate if it's not enought.
This should do the work

Azure Service Fabric Changing Setting After Deployment

We're testing out service fabric and experiencing some issues.
Firstly the VM Type A1v2 comes with 10gb of HDD, however the issue is the log file takes up over 9gb of space almost immediately so deployments then fail with an out of disk space exception. So I've discovered I need to set the SharedLogSizeInMb to a smaller value as below:
"fabricSettings": [
{
"name": "KtlLogger",
"parameters": [
{
"name": "SharedLogSizeInMB",
"value": "256"
}
]
}
],
The issue now is I am not sure how to apply this change. I can't seem to find a way to do it in the portal and when I download the powershell setup scripts from the portal in the Service Fabric setup process and run them to create a new Service Fabric instance it just gets as far as deploying and fails.
So my questions are:
1) How should I be adjusting this setting, I assume it can only be done via an ARM script?
2) Can these scripts only be used to create a new Service Fabric cluster or can you also somehow run them to just change a setting?
3) Should the vanilla script I export from Azure just work? The error messages I can find are very generic and non explanatory. Seems to just be an exception getting thrown in each VM when trying to create service fabric. I am pretty much using all the default settings nothing special.
Thanks,
Oliver
EDIT
My files in comment 2 just have this over and over.
2017/05/02-04:22:00.388,Info,4864,ImageStoreClient.ManagedFileLock,Obtained writer lock for D:\SvcFab\lock
2017/05/02-04:22:00.739,Info,4864,FabricDeployer.FabricDeployer,Executing Configure /fabricBinRoot:C:\Program Files\Microsoft Service Fabric\bin /fabricDataRoot:D:\SvcFab /fabricLogRoot:D:\SvcFab\Log /cm:C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\1.0.0.35\TempClusterManifest.xml /oldClusterManifestString: /im:C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.ServiceFabric.ServiceFabricNode\1.0.0.35\InfrastructureManifest.xml /instanceId: /targetVersion: /nodeName: /nodeTypeName: /runAsType: /runAsAccountName: /runAsPassword: /serviceStartupType: /output: /currentVersion: /error: /bootstrapMSIPath: /machineName: /fabricPackageRoot: /jsonClusterConfigLocation: /enableCircularTraceSession:False
2017/05/02-04:22:02.452,Info,4864,FabricDeployer.FabricDeployer,Running operation System.Fabric.FabricDeployer.ConfigureOperation
2017/05/02-04:22:02.576,Info,4864,FabricDeployer.FabricDeployer,Creating FabricDataRoot D:\SvcFab, if it doesn't exist on machine
2017/05/02-04:22:02.576,Info,4864,FabricDeployer.FabricDeployer,Creating FabricLogRoot D:\SvcFab\Log, if it doesn't exist on machine
2017/05/02-04:22:04.907,Info,4864,ImageStoreClient.ManagedFileLock,Released writer lock on D:\SvcFab\lock
My Event Logs this over and over.
Failed starting service, Error: Microsoft.Azure.ServiceFabric.Extension.Core.AgentException: Configure node failed with code -1
at Microsoft.Azure.ServiceFabric.Extension.Core.NodeBootstrapAgent.StartFabricHostService(Boolean isBootstrapping)
ERROR: Microsoft.Azure.ServiceFabric.Extension.Core.AgentException: Configure node failed with code -1
at Microsoft.Azure.ServiceFabric.Extension.Core.NodeBootstrapAgent.StartFabricHostService(Boolean isBootstrapping)
at Microsoft.Azure.ServiceFabric.Extension.Core.NodeBootstrapAgent.d__11.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.Azure.ServiceFabric.Extension.Core.NodeBootstrapAgent.d__0.MoveNext()

Azure Function Multiple Job Hosts

I am experiencing a strange problem with Azure Functions that is starting up multiple job hosts. The initial host seems to startup and subsequent hosts error trying to acquire singleton lock. It is really noticeable when I disable one of the jobs and the error message appears that the "function runtime is unable to start". I noticed that my timer triggers were executing multiple times per their configuration "0 */30 * * * *" which caused me to dig deeper into this situation.
Pid 1 2017-04-25T13:30:06.680 Staging table updated successfully.
Pid 1 2017-04-25T13:30:06.680 Updating the base table from the staging table.
Pid 2 2017-04-25T13:30:06.680 Staging table updated successfully.
Pid 2 2017-04-25T13:30:06.680 Updating the base table from the staging table.
Details about the Function App:
- Azure Function running under the Dynamic/Consumption plan
- 5 functions running from class libraries (followed this guide - https://blogs.msdn.microsoft.com/appserviceteam/2017/03/16/publishing-a-net-class-library-as-a-function-app/)
- 2 functions are executed from a timer, every 30 minutes "0 */30 * * * *"
- 1 timer trigger disabled while waiting for development time
- 1 blob trigger watching a container for uploads from IoT Hub
- 1 EventHub trigger receiving events from IoT Hub (sparse events so no heavy load here)
Steps to reproduce:
- Stand up Azure function with Dynamic plan
- Create the Azure functions in the portal (ran into issues not doing this prior)
- Deploy the functions from VSTS, using WebDeploy from the guide above
- Make sure the functions tried to start
- Disable one of the functions to force a restart
- Error messages start displaying
Log pulled from the Function:
Link to log file
I have stopped the Azure Function App Service, removed the lock folder to see if that helps acquire singleton locks which it does, but as soon as a function is enabled/disabled or pushed from VSTS using the web deploy the errors return. I have rebuilt the Azure Function a couple of times and the outcome is still the same.
We are in the process of trying to understand how to troubleshoot this issue so we can create a monitoring process around this scenario.
Edit
The function that executed twice is setup with the following (all of the functions look very similar to this):
function.json
{
"scriptFile": "..\\bin\\IngestionFunctionClassLibrary.dll",
"entryPoint": "IngestionFunctionClassLibrary.Functions.AnalyticsUpdate.Run",
"bindings": [
{
"name": "myTimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "0 */30 * * * *"
}
],
"disabled": true`
}
project.json
{
"frameworks": {
"net46":{
"dependencies": {
}
}
}
}
Messages that look like Unable to acquire Singleton lock are actually not errors, but simply informational messages. What it means that your Function App was scaled out to multiple instances (in your case about 5). There are some lease resources that can intrinsically only be held by one instance (to support singleton behavior). So once an instance gets the lease, all others will display this message.

How to run a job through Queue in arangodb

I am moving from ArangoDb 2.5.7 to ArangoDb 3.1.7. I have managed to make everything work except the Jobs. I look at the documentation and I don't understand If I have to create a separate service just for that ?
So, I have a foxx application myApp
manifest.json
{
"name": "myApp",
"version": "0.0.1",
"author": "Deepak",
"files":
{
"/static": "static"
},
"engines":
{
"arangodb": "^3.1.7"
},
"scripts":
{
"setup": "./scripts/setup.js",
"myJob": "./scripts/myJob.js"
},
"main": "index.js"
}
index.js
'use strict';
module.context.use('/one', require('./app'));
app.js
const createRouter = require('org/arangodb/foxx/router');
const controller = createRouter();
module.exports = controller;
const queues = require('#arangodb/foxx/queues');
queue = queues.create('myQueue', 2);
queue.push({mount:"/myJob", name:"myJob"}, {"a":4}, {"allowUnknown": true});
myJob.js
const argv = module.context.argv;
var obj = argv[0];
console.log('obj:'+obj);
I get following error:
Job failed:
ArangoError: service not found
Mount path: "/myJob".
I am not sure if I have to expand myJob as an external service. Can you help me. I don't see a complete example of how to do it.
To answer your question:
You do not have to extract the job script into a new service. You can specify the mount point of the current service by using module.context.mount.
You can find more information about the context object in the documentation: https://docs.arangodb.com/3.1/Manual/Foxx/Context.html
By the way, it's probably not a good idea to arbitrarily create jobs at mount-time. The common use case for the queue is to create jobs in route handlers as a side-effect of incoming requests (e.g. to dispatch a welcome e-mail on signup).
If you create a job at mount-time (e.g. in your main file or a file required by it) the job will be created whenever the file as executed, which will be at least once for each Foxx thread (by default ArangoDB uses multiple Foxx threads to handle parallel requests) or when development mode is enabled once per request(!).
Likewise if you create a job in your setup script it will be created whenever the setup script is executed, although this will only happen in one thread each time (but still once per request when development mode is active).
If you need e.g. a periodic job that lives alongside your service, you should put it in a unique queue and only create it in your setup script after checking whether it already exists.
On the changes in the queue API:
The queue API changed in 2.6 due to a serious issue with the old API that would frequently result in pending jobs not being properly rescheduled when the ArangoDB daemon was restarted after a job had been pushed to the queue.
Specifically ArangoDB 2.6 introduced so-called script-based (rather than function-based) job types: https://docs.arangodb.com/3.1/Manual/ReleaseNotes/UpgradingChanges26.html#foxx-queues
Support for the old function-based job types was dropped in ArangoDB 2.7 and the cookbook recipe was updated to reflect script-based job types: https://docs.arangodb.com/2.8/cookbook/FoxxQueues.html
A more detailed description of the new queue can be found in the documentation: https://docs.arangodb.com/3.1/Manual/Foxx/Scripts.html

Resources