puppetdb: purge stockpile Command Queue - puppet

I recently tried to upgrade a one of our puppet compile masters from 4.8.2 to 5.5.10. however our puppetDB remained on version 4.4. This caused a schema validation similar to the one reported in PDB-3743. I have since reverted the change however i am now left with a command Queue of 2k+.
Inspecting the stockpile directory /var/lib/puppetdb/stockpile/cmd/q. I can see that all of the files in the queue are reports from hosts using which used the upgraded puppet master and they all have a job_id: null value.
Could anyone enlighten me on how to purge this queue. Moving files out of this directory does not make the queue go down. Further when does the queue runner try to reprocess the files in its queue and can one manually force this. I only ever see a stack trace for the first time the report is submitted suggesting that the queue runner never attempts to re-process theses reports.

Answering my own question i was able to clear down the queue by shutting down puppetdb and removing all the files from /var/lib/puppetdb/stockpile/cmd/q. i further noticed that puppetdb will try to re-process any files in /var/lib/puppetdb/stockpile/cmd/q when it starts up.

Related

Perforce has a scalability issue with large number of files to submit, reconcile, or integrate. Anything I can do to speed it up?

I'm working with the unreal engine so when Epic does an update to the engine, there can be a need to update 100k files or more. The perforce server sits in an AWS instance.
The normal way to manage unreal engine updates, even recommended by perforce, is to write over the local files and use 'reconcile' to find the changes.
The reconcile on my laptop which was cutting edge maybe 3 years ago takes 12 hours and then fails silently.
If I manage to make the proper changelist, by adding things little by little, and I hit submit, it takes 12 hours and fails silently.
Doing a submit in p4v will "jam" perforce (the process will stall indefinitely and when killed no other perforce command will complete, even trivial ones). The only way to get perforce commands working again is to reboot the client computer.
When using the command line interface, commands like reconcile or submit have no useful output as to what is going on and return error messages such as:
Some file(s) could not be transferred from client.
Submit aborted -- fix problems then use 'p4 submit -c 1996'
Anything I can do to speed these things up or figure out what is perforce choking on?
To debug the failing submit, repeat it from the command line. The command line client will show you output for each file in the changelist, along with an error message for any where the transfer failed:
C:\Perforce\test\submit>p4 submit -d "Test submit"
Submitting change 312.
Locking 2 files ...
add //stream/main/submit/bar#1
add //stream/main/submit/foo#1
open for read: c:\Perforce\test\submit\foo: The system cannot find the file specified.
Submit aborted -- fix problems then use 'p4 submit -c 312'.
Some file(s) could not be transferred from client.
In the above example, note that the system error message (The system cannot find the file specified.) is printed on the line immediately before Perforce's Submit aborted error. P4V will sometimes obscure this error by displaying it in a different place from the top-level error message, or not displaying it at all, but from the CLI you should always be able to find it by scrolling up in the terminal.
If the server is "jammed" when trying to open or submit 100k files, it suggests that it's under-resourced; Perforce installations scale to millions of files easily, but the hardware needs to be available to support that. Memory is the most typical bottleneck so the first thing I'd check/adjust would be the memory allocation for your AWS instance.

Databricks Connect: DependencyCheckWarning: The java class may not be present on the remote cluster

I was performing yet another execution of local Scala code against the remote Spark cluster on Databricks and got this.
Exception in thread "main" com.databricks.service.DependencyCheckWarning: The java class <something> may not be present on the remote cluster. It can be found in <something>/target/scala-2.11/classes. To resolve this, package the classes in <something>/target/scala-2.11/classes into a jar file and then call sc.addJar() on the package jar. You can disable this check by setting the SQL conf spark.databricks.service.client.checkDeps=false.
I have tried reimporting, cleaning and recompiling the sbt project to no avail.
Anyone know how to deal with this?
Apparently the documentation has that covered:
spark.sparkContext.addJar("./target/scala-2.11/hello-world_2.11-1.0.jar")
I guess it makes sense that everything that you are writing as code external to Spark is considered a dependency. So a simple sbt publishLocal and then pointing to the jar path in above command will sort you out.
My main confusion came from the fact that I didn't need to do this for a very long while until at some point this mechanism kicked it. Rather inconsistent behavior I'd say.
A personal observation after working with this setup is that it seems you only need to publish a jar a single time. I have been changing my code multiple times and changes are reflected even though I have not been continuously publishing jars for the new changes I made. That makes the whole task a one off. Still confusing though.

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

How to know which instance deleted my file on a Linux server?

I have a workflow pipeline which will generate data files on a Linux server periodically, and also a cleanup service which will remove data files which are older than a week.
However, sometimes I found that a new generated data file will be missing, which is definitely no older than a week. I'm not sure whether it is a logic bug of the cleanup service, or another program did it. Currently I don't have any idea on how to investigate this issue. Is there any method to log all the file deletion activities and related process id as well as process name?
Thanks in advance.

Process text files ftp'ed into a set of directories in a hosted server

The situation is as follows:
A series of remote workstations collect field data and ftp the collected field data to a server through ftp. The data is sent as a CSV file which is stored in a unique directory for each workstation in the FTP server.
Each workstation sends a new update every 10 minutes, causing the previous data to be overwritten. We would like to somehow concatenate or store this data automatically. The workstation's processing is limited and cannot be extended as it's an embedded system.
One suggestion offered was to run a cronjob in the FTP server, however there is a Terms of service restriction to only allow cronjobs in 30 minute intervals as it's shared-hosting. Given the number of workstations uploading and the 10 minute interval between uploads it looks like the cronjob's 30 minute limit between calls might be a problem.
Is there any other approach that might be suggested? The available server-side scripting languages are perl, php and python.
Upgrading to a dedicated server might be necessary, but I'd still like to get input on how to solve this problem in the most elegant manner.
Most modern Linux's will support inotify to let your process know when the contents of a diretory has changed, so you don't even need to poll.
Edit: With regard to the comment below from Mark Baker :
"Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files."
That will happen with the inotify watch you set on the directory level - the way to make sure you then don't pick up the partial file is to set a further inotify watch on the new file and look for the IN_CLOSE event so that you know the file has been written to completely.
Once your process has seen this, you can delete the inotify watch on this new file, and process it at your leisure.
You might consider a persistent daemon that keeps polling the target directories:
grab_lockfile() or exit();
while (1) {
if (new_files()) {
process_new_files();
}
sleep(60);
}
Then your cron job can just try to start the daemon every 30 minutes. If the daemon can't grab the lockfile, it just dies, so there's no worry about multiple daemons running.
Another approach to consider would be to submit the files via HTTP POST and then process them via a CGI. This way, you guarantee that they've been dealt with properly at the time of submission.
The 30 minute limitation is pretty silly really. Starting processes in linux is not an expensive operation, so if all you're doing is checking for new files there's no good reason not to do it more often than that. We have cron jobs that run every minute and they don't have any noticeable effect on performance. However, I realise it's not your rule and if you're going to stick with that hosting provider you don't have a choice.
You'll need a long running daemon of some kind. The easy way is to just poll regularly, and probably that's what I'd do. Inotify, so you get notified as soon as a file is created, is a better option.
You can use inotify from perl with Linux::Inotify, or from python with pyinotify.
Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files.
With polling it's less likely you'll see partial files, but it will happen eventually and will be a nasty hard-to-reproduce bug when it does happen, so better to deal with the problem now.
If you're looking to stay with your existing FTP server setup then I'd advise using something like inotify or daemonized process to watch the upload directories. If you're OK with moving to a different FTP server, you might take a look at pyftpdlib which is a Python FTP server lib.
I've been a part of the dev team for pyftpdlib a while and one of more common requests was for a way to "process" files once they've finished uploading. Because of that we created an on_file_received() callback method that's triggered on completion of an upload (See issue #79 on our issue tracker for details).
If you're comfortable in Python then it might work out well for you to run pyftpdlib as your FTP server and run your processing code from the callback method. Note that pyftpdlib is asynchronous and not multi-threaded, so your callback method can't be blocking. If you need to run long-running tasks I would recommend a separate Python process or thread be used for the actual processing work.

Resources