How does pbs/torque/maui choose node? - pbs

We know that all the nodes features are stored in server_priv/nodes file. Everytime when we're using:
qsub -l nodes=1:linux
or
#PBS -l nodes=1:linux
to submit jobs, since we may have hundreds of machines which have linux feature. I wonder how the torque selects the right node?
From the top to the bottom searching the server_priv/nodes file?
Alphabetical?
Depends on the machine workload?
Any help is greatly appreciated!

In this case, Maui is choosing the nodes to allocate to the job. Maui is the scheduler and therefore the decision-maker. I believe the default policy is firstavailable, which I think will be the first available node in the order that the nodes are specified in the nodes file (located in PBS_HOME/server_priv/nodes).
However, I don't know which node allocation policy your site is using. If you have access, you'd check the config file for Maui for NODEALLOCATIONPOLICY to see which one you are using. If you don't have access you'd need to contact an administrator. To better understand the different options for node allocation you can check out some of the Maui docs.

Related

elasticsearch applying security on a running cluster

I've an ELK stack 7.6.2 with logstash, an elasticsearch cluster with 3 nodes and kibana. I would like to add security but the only doc I can fin always start 'from scratch' I would like to have an example on an already running cluster in order not te mess up with it. Thanks for your help.
Guillaume
You can not enable security features on an already running cluster. Security-settings are classified as static, meaning that they can not be dynamically updated on the fly:
static:
These settings must be set at the node level, either in the elasticsearch.yml file, or as an environment variable or on the command line when starting a node. They must be set on every relevant node in the cluster.
dynamic:
These settings can be dynamically updated on a live cluster with the cluster-update-settings API.
See https://www.elastic.co/guide/en/elasticsearch/reference/7.6/modules.html for reference and for all settings that can be dynamically updated (you won't find security settings there).
Also, from this guide (https://www.elastic.co/guide/en/elasticsearch/reference/current/get-started-enable-security.html) one can tell that you need to stop your running elasticsearch and kibana instances in order to enable security.
I hope I could help you.

Use Microsoft Azure as a computing cluster

My lab just got a sponsorship from Microsoft Azure and I'm exploring how to utilize it. I'm new to industrial level cloud service and pretty confused about tons of terminologies and concepts. In short, here is my scenario:
I want to experiment the same algorithm with multiple datasets, aka data parallelism.
The algorithm is implemented with C++ on Linux (ubuntu 16.04). I made my best to use static linking, but still depends on some dynamic libraries. However these dynamic libraries can be easily installed by apt.
Each dataset is structured, means data (images, other files...) are organized with folders.
The idea system configuration would be a bunch of identical VMs and a shared file system. Then I can submit my job with 'qsub' from a script or something. Is there a way to do this on Azure?
I investigated the Batch Service, but having trouble installing dependencies after creating compute node. I also had trouble with storage. So far I only saw examples of using Batch with Blob storage, with is unstructured.
So are there any other services in Azure can meet my requirement?
I somehow figured it out my self based on the article: https://azure.microsoft.com/en-us/documentation/articles/virtual-machines-linux-classic-hpcpack-cluster/. Here is my solution:
Create a HPC Pack with a Windows head node and a set of Linux compute node. Here are several useful template in Marketplace.
From Head node, we can execute command inside Linux compute node, either inside HPC Cluster Manager, or using "clusrun" inside PowerShell. We can easily install dependencies via apt-get for computing node.
Create a File share inside one of the storage account. This can be mounted by all machines inside the cluster.
One glitch here is that for some encryption reason, you can not mount the File share on Linux machines outside the Azure. There are two solutions in my head: (1) mount the file share to Windows head node, and create file sharing from there, either by FTP or SSH. (2) create another Linux VM (as a bridge), mount the File share on that VM and use "scp" to communicate with it from outside. Since I'm not familiar with Windows, I adopted the later solution.
For executable, I simply uploaded the binary executable compiled on my local machine. Most dependencies are statically linked. There are still a few dynamic objects, though. I upload these dynamic object to the Azure and set LD_LIBRARY_PATH when execute programs on computing node.
Job submission is done in Windows head node. To make it more flexible, I wrote a python script, which writes XML files. The Job Manager can load these XML files to create a job. Here are some instructions: https://msdn.microsoft.com/en-us/library/hh560266(v=vs.85).aspx
I believe there should be more a elegant solution with Azure Batch Service, but so far my small cluster runs pretty well with HPC Pack. Hope this post can help somebody.
Azure files could provide you a shared file solution for your Ubuntu boxes - details are here:
https://azure.microsoft.com/en-us/documentation/articles/storage-how-to-use-files-linux/
Again depending on your requirement you can create a pseudo structure via Blob storage via the use of containers and then the "/" in the naming strategy of your blobs.
To David's point, whilst Batch is generally looked at for these kind of workloads it may not fit your solution. VM Scale Sets(https://azure.microsoft.com/en-us/documentation/articles/virtual-machine-scale-sets-overview/) would allow you to scale up your compute capacity either via load or schedule depending on your workload behavior.

what is "spark.history.retainedApplications" points to

As per apache doc "http://spark.apache.org/docs/latest/monitoring.html"
spark.history.retainedApplications points to "The number of application UIs to retain. If this cap is exceeded, then the oldest applications will be removed"
But I see more than configured apps into the UI. Is it correct or it stores those many apps into memory only or load again into memory when needed. Please clarify. Thx
That setting specifically applies to the history server. If you don't have one started (it's typically used with YARN and Mesos I believe), then the setting you're after is spark.ui.retainedJobs. Check the Spark UI configuration parameters for more details.
These settings only apply to jobs, so in order to pass them to the master itself, check the spark.deploy options in the stand-alone deployment section. You can set them via the SPARK_MASTER_OPTS environment variable.
If you want to clean the data files produced by workers, check the spark.worker.cleanup options in the same section. You can set them via the SPARK_WORKER_OPTS environment variable on your workers.

Setup Puppet at first place

I am trying to understand the best practice of setting up Puppet in the first place, let's say I have 1000 existing servers needs to be managed Puppet.
Do I manually install Puppet agent on each or there is a better way.
Sorry if this question is too generic just want to have some idea.
1000 servers could be a lot for a single master instance. of course it will depend on the master specs, and other factors related to the puppet runs.
There are few questions you need to answer first to determine how are you going to go about it such as
Puppet Enterprise or Open Source? What is the current configuration night mare you are trying to solve?
What is the current configuration data related to the challenge or
problem you have?
What are the current business roles (e.g. web server, load
balancer,database, ..etc) related to the problem you have? What
makes a role in terms of configurations?
I would suggest that you start first small to learn more about the puppet DSL, and its ECO system (master, agent, puppetdb, console/dashboard). I also recommend you start with the free 10 nodes puppet Enterprise as it will let you focus more on the problem at hand not how to configure the puppet masters, and agents, how to scale them, ..etc.
One more thing install puppet agent every where if you can in NOOP/disabled mode to get at least facts and run it in a masterless fashion using puppet apply when you need to. i find NOOP mode more useful as it tells you what needs to be changed, also you can enforce changes using --no-noop
hope that will get you started.
To answer your question: Yes, Puppet agent would need to be installed on every node. If you are managing 1000 nodes, I would assume you have your own OS image. In this case, its best to add it to the OS image, and use this image on 1000 nodes.

How to raise or lower the log level in puppet master?

I am using puppet 3.2.3, passenger and apache on CentOS 6. I have 680 compute nodes in a cluster along with 8 gateways users use to log in to the cluster and submit jobs. All the nodes and gateways are under puppet control. I recently upgraded from 2.6. The master logs to syslog as desired, but how to change the log level for the master escapes me. I appear to have the choice of --debug, or nothing. Debug logs far too much detail, while not using that switch simply logs each time passneger/apache launch a new worker to handle incoming connections.
I find nothing in the on-line docs about doing this. What I want is to log each time a nodes hits the server; but I do not need to see the compiled catalogue, or resources in/var/log/messages.
How is this accomplished?
This is a hack, but here is how I solved the problem. In the file (config.ru) that passenger uses to launch puppet via rack middleware, which in my system lives in /usr/share/puppet/rack/puppetmasterd, I noticed these lines
require 'puppet/util/command_line'
run Puppet::Util::CommandLine.new.execute
So, this I edited to become
require 'puppet/util/command_line'
Puppet::Util::Log.level = :info
run Puppet::Util::CommandLine.new.execute
I suppose other choices for Log.level could be :warn and others.

Resources