Puppet splay & splaylimit explained? - puppet

I'm looking for someone to explain the usage of splay & splaylimit within Puppet configuration.
The documentation on the Puppet site itself is limited to say the least.
I am suffering from thundering herd on my master, i.e. a number of agents hammering the agent for their catalog all at once, to the point where the master falls over, and each agent reports a timeout error.
I know I need to use the splay & splaylimit options in my config to stop all agents checking in at once, but I'm unsure of how to implement it.
Can anyone assist please?

The splay and splaylimit settings work together with the runinterval setting to help spread out agents' catalog requests in time. They are useful primarily in situations where many machines' agents may be started at once, such as when a bunch of VMs all start up together under control of the same host.
Ordinarily, the agent, when running in daemon mode, starts a catalog run when it first starts up, and again at runinterval intervals. If the splay option is set true then it instead generates a (pseudo-)random delay, not exceeding splaylimit, and delays the start of each catalog run by that amount of time, relative to when it would have started if splaying were disabled.
Thus, if you have a thundering herd problem arising from many agents being started at about the same time, then you could try to address it by setting
splay = true
in your agents' configurations. If you don't configure a specific splaylimit then it defaults to your runinterval, resulting in the catalog runs of all the agents started at the same time being spread more or less uniformly over the whole interval, and therefore over all time going forward.
On the other hand, if your agents' startup is not somehow orchestrated so as to cause them to bunch up, then splaying doesn't really do anything for you. That is, if agent startups are approximately random anyway then it doesn't help you to shift their catalog request cycles.
I think splay can also help when you run the agent in --onetime mode via an external scheduler (e.g. cron). That would present a good use case for the splaylimit setting, because in that case the configured runinterval has nothing to do with when or how frequently the agent runs.

Related

preferred way to schedule job #Scheduled vs crontab

I have to run one utility periodically for instance say, every minute.
So, I have two option #Scheduled spring boot vs crontab of linux box that we are using to deploy the artifact.
SO, my question is which way should I use?
what are the pros and cons for each solution , any other solution if you can suggest.
Just for comparing between these two, I don't have much points, but only based on this situation which I faced now. I just built a new end point and am doing performance testing and stress testing for the same on production. I am yet to decide the cron schedule times, and those may need a slight tweaking over some more time of observation. Setting via #Scheduled needs me to deploy/restart application every time I make a change.
Application restart generally takes more time than crontab edit.
Other than this, a few points considering the aspects of availability and scalability -
Setting only via crontab on a single server would mean a single point of failure, if the server goes down.
Setting via #Scheduled also could mean the same.
If you have multiple instances of the server, this could mean endpoint getting triggered twice and you may not want to have the same. Worst case, is if the scaling up happens after a long time, and you wrote the #Scheduled endpoint long back, while it was only deployed on a single server and then you forgot. As soon as scaling up happens, the process will start getting hit twice.
So, none of these seem to be the best in terms of points of availability and scalability.
In such situations, ideally a distributed cron management system (I have heard about Rundeck) is needed, which manages which, out of the available servers is to be called to hit the desired end point and if needed to call the next server in case the first one is down.
In case of any need for investigation. logs of rundeck could be checked to find the server which was actually called.

Liferay: timeout on the regeneration of the search indexes

When launching the functionality of a "Rebuild all search indexes." very often the request times out, perhaps because the browser waits too long the answer.
How do I fix this? As it is I can not figure out when it ends the regeneration of the indices and if unsuccessful
Liferay 6.2
If your data set is pretty big regenerating the indexes can take a lot of time. There’s no ‘fix’ for this. You could, for example, use a different Indexer, such as Solr, to remove the burden from the machine running Liferay.
I can always tell when it is done by running it late at night during low traffic hours and monitoring the JMX (CPU) activity and (tomcat) logs. Both will give indications when the indexing is completed with various tasks and starting new ones, but I find the JMX monitoring to be clearest. In our case, there is around 500+MB of index data on each node and it takes ~ 2.5 hours give or take. I also kick off indexing on each application node, since we have found the "cluster link" software unreliable to copy the index across cluster nodes...

Is there any performance overhead when'RunInterval' on Puppet agent is set to 0 (continuously run)?

By default, a Puppet agent polls after every 30 minutes for any configuration changes on Puppet master. So, there is always a lag (when there is any configuration change on master) of <= 30 minutes in applying configuration changes on applicable agents.
I want the changes to be applied to agents in near real time (approximately in less than a minute). For that, I want to set 'RunInterval' to 0 on agent, so that the changes are applied in near real time.
I want to understand if there is any performance overhead associated when 'RunInterval' is set to 0 (continuously run). How do the agent functions when it is set to run continuously? Does it use some sort of long polling? Is it recommended/advisable to override the default and set 'RunInterval' to 0 (continuously run)?
Yes, there is a good amount of overhead.
There is overhead at the master, which must handle many more requests per unit time -- perhaps as many as 200 times as many requests, depending on how long catalog runs take at the agents. For each request, it must sync plugins with the agent, compile and return a catalog, and possibly serve files, none of which are trivial.
There is also overhead at the agent. For each catalog run, it must at minimum go through each declared resource and test whether that resource is in the specified target state. Doing so is non-trivial even when no changes are required.
Your strategy is more likely to fall over because of the greatly-increased demands it will place on your master than because of the extra load on the clients, but your clients will definitely feel it if they're already carrying a heavy load.
If you want the ability to occasionally trigger specific servers to sync immediately, then consider looking into mcollective.
If you want the ability to routinely trigger many servers to sync immediately, then consider switching to masterless mode, combined with mcollective or some other kind of group remote-control software.
The puppet master will collate your puppet manifests into a single file for your client to use. If all clients are querying the puppet master at the same time, it will likely struggle under the load, provided that you have a non-trivial puppet deployment with several hundred/thousand puppet-managed servers.
What should matter to you is eventual consistency - the certainty that given enough time, the clients will converge to the desired configuration.

Simultaneous Lotus notes server-side agents

In my Lotus Notes workflow application, I have a scheduled server agent (every five minutes). When user's act on a document, a server-side agent is also triggered (this agent modifies the said document, server-side). In the production, we are receiving many complaints that the processing are incomplete or sometimes not being processed at all. I checked the server configuration and found out that only 4 agents can run concurrently. Being a global application with over 50,000 users, the only thing that I can blame with these issues are the volume of agent run, but I'm not sure if I'm correct (I'm a developer and lacks knowledge about these stuffs). Can someone help me find if my reasoning is correct (on simulteneous agents) and help me understand how I can solve this? Can you provide me references please. Thank you in advance!
Important thing to remember.
Scheduled agents on the server will only run one agent from the same database at any given time!
So if you have Database A with agent X (5 mins) and Y (10 mins). It will first run X. Once X completes which ever is scheduled next (X or Y) will run next. It will never let you run X + Y at the same time if they are in the same database.
This is intended behaviour to stop possible deadlocks within the database agents.
Also you have an schedule queue which will have a limit to the number of agents that can be scheduled up. For example if you have Agent X every 5 minutes, but it takes 10 minutes to complete, your schedule queue will slowly fill up and then run out of space.
So how to work around this? There is a couple of ways.
Option 1: Use Program Documents on the server.
Set the agent to scheduled "Never" and have a program document execute the agent with the command.
tell amgr run "dir/database.nsf" 'agentName'
PRO:
You will be able to run agents in <5 minute schedule.
You can run multiple agents in the same database.
CON:
You have to be aware of what the agent is interacting with, and code for it to handle other agents or itself running at the same time.
There can be serious performance implications in doing this. You need to be aware of what is going on in the server and how it would impact it.
If you have lots of databases, you have a messy program document list and hard to maintain.
Agents via "Tell AMGR" will not be terminated if they exceed the agent execution time allowed on the server. They have to be manually killed.
There is easy way to determine what agents are running/ran.
Option 2: Create an agent which calls out to web agents.
PRO:
You will be able to run agents in <5 minute schedule.
You can run multiple agents in the same database.
You have slightly better control of what runs via another agent.
CON:
You need HTTP running on the server.
There are performance implications in doing this and again you need to be aware of how it will interact with the system if multiple instances run or other agents.
Agents will not be terminated if they exceed the agent execution time allowed on the server.
You will need to allow concurrent web agents/web services on the server or you can potentially hang the server.
Option 3: Change from scheduled to another trigger.
For example "When new mail arrives". Overall this is the better option of the three.
...
In closing I would say that you should rarely use the "Execute every 5 mins" if you can, unless it is a critical agent that isn't going to be executed by multiple users across different databases.

Resource mange external nodes in Jenkins for tests

My problem is that I have code that need a rebooted node. I have many long running Jenkins test jobs that needs to be executed on rebooted nodes.
My existing solution is to define multiple "proxy" machines in Jenkins with the same label (TestLable) and 1 executor per machine. I bind all the test jobs to the label (TestLable). In the test execution script I detect the Jenkins machine (Jenkins env. NODE_NAME) and use that to know what physical physical machine the tests should use.
Do anybody know of a better solution?
The above works but I need to define a high number of “nodes/machines” that may not be needed. What I would like was a plugin that would be able to grant a token to a Jenkins job. This way a job would not be executed before a Jenkins executor and a token was free. The token should be a string so that my test jobs could use it to know what external node it could use.
We have written our own scheduler that allocates stuff before starting Jenkins nodes. There may be a better solution - but this works for us mostly. I've yet to come across an off-the-shelf scheduler that can deal with complicated allocation of different hardware resources. We have n box types, allocated to n build types.
Some build types we have are not compatible together without destroying all persistent data - which may be required as it takes a long time to gather. Some jobs require combinations of these hardware types. We store the details in a DB, and then use business logic to determine how it is allocated. We've often found that particular job types need additional business logic or extra data fields to account for their specific requirements.
So it may be the best way is to write your own scheduler, in your own language of choice, which takes into account your particular needs.

Resources