Perforce has a scalability issue with large number of files to submit, reconcile, or integrate. Anything I can do to speed it up? - perforce

I'm working with the unreal engine so when Epic does an update to the engine, there can be a need to update 100k files or more. The perforce server sits in an AWS instance.
The normal way to manage unreal engine updates, even recommended by perforce, is to write over the local files and use 'reconcile' to find the changes.
The reconcile on my laptop which was cutting edge maybe 3 years ago takes 12 hours and then fails silently.
If I manage to make the proper changelist, by adding things little by little, and I hit submit, it takes 12 hours and fails silently.
Doing a submit in p4v will "jam" perforce (the process will stall indefinitely and when killed no other perforce command will complete, even trivial ones). The only way to get perforce commands working again is to reboot the client computer.
When using the command line interface, commands like reconcile or submit have no useful output as to what is going on and return error messages such as:
Some file(s) could not be transferred from client.
Submit aborted -- fix problems then use 'p4 submit -c 1996'
Anything I can do to speed these things up or figure out what is perforce choking on?

To debug the failing submit, repeat it from the command line. The command line client will show you output for each file in the changelist, along with an error message for any where the transfer failed:
C:\Perforce\test\submit>p4 submit -d "Test submit"
Submitting change 312.
Locking 2 files ...
add //stream/main/submit/bar#1
add //stream/main/submit/foo#1
open for read: c:\Perforce\test\submit\foo: The system cannot find the file specified.
Submit aborted -- fix problems then use 'p4 submit -c 312'.
Some file(s) could not be transferred from client.
In the above example, note that the system error message (The system cannot find the file specified.) is printed on the line immediately before Perforce's Submit aborted error. P4V will sometimes obscure this error by displaying it in a different place from the top-level error message, or not displaying it at all, but from the CLI you should always be able to find it by scrolling up in the terminal.
If the server is "jammed" when trying to open or submit 100k files, it suggests that it's under-resourced; Perforce installations scale to millions of files easily, but the hardware needs to be available to support that. Memory is the most typical bottleneck so the first thing I'd check/adjust would be the memory allocation for your AWS instance.

Related

Perforce change-submit trigger to run script on client

I figured I'd post here, after posting on SuperUser, since I want to get input from software developers who might have encountered this scenario before!
I would like to initiate a series of validation steps on the client side on files opened within a changelist before allowing the changelist to be submitted.
For example, I wish to ensure that if a file is opened for add, edit, or remove as part of a changelist, that a particular related file will be treated appropriately based on a matrix of conditions for that corresponding file:
Corresponding file being opened for add/edit/remove
Corresponding file existing on disk vs. not existing on disk
Corresponding file existing in depot vs. not existing in depot
Corresponding file having been changed vs. not having been changed relative to depot file
These validation steps must be initiated before the submit is accepted by the Perforce server. Furthermore, the validation must be performed on the client side since I must be able to reconcile offline work with the copies on clients' disks.
Environment:
Perforce 2017.2 server
MacOS and Windows computers submitting to different branches
Investigative Avenues Already Covered
Initial design was a strictly client-side custom tool, but this is not ideal since this would be a change of the flow that users are familiar with, and I would also have to implement a custom GUI.
Among other approaches, I considered creating triggers in 2017.2; however, even if I were to use a change-content trigger with all the changelist files available on the server, I would not be able to properly perform the validation and remediation steps.
Another possibility would be using a change-submit trigger and to use the trigger script variables in 2017.2 to get the client's IP, hostname, client's current working directory, etc. so that you could run a script on the server to try to connect remotely to the client's computer. However, running any script on the client's computer and in particular operating on their local workspace would require credentials that most likely will not be made available.
I would love to use a change-submit trigger on the Perforce server to initiate a script/bundled executable on the client's computer to perform p4 operations on their workspace to complete the validation steps. However, references that I've found (albeit from years ago) indicate that this is not possible:
https://stackoverflow.com/a/16061840
https://perforce-user.perforce.narkive.com/rkYjcQ69/p4-client-side-submit-triggers
Updating files with a Perforce trigger before submit
Thank you for reading and in advance for your help!
running any script on the client's computer and in particular operating on their local workspace would require credentials that most likely will not be made available.
This is the crux of it -- the Perforce server is not allowed to send the client arbitrary code to execute. If you want that type of functionality, you'd have to punch your own security hole in the client (and then come up with your own way of making sure it's not misused), and it sounds like you've already been down that road and decided it's not worth it.
Initial design was a strictly client-side custom tool, but this is not ideal since this would be a change of the flow that users are familiar with, and I would also have to implement a custom GUI.
My recommendation would be to start with that approach and then look for ways to decrease friction. For example, you could use a change-submit trigger to detect whether the user skipped the custom workflow (perhaps by having the custom tool put a token in the change description for the trigger to validate), and then give them an error message that puts them back on track, like "Please run Tools > Change Validator, or contact wanda#yourdomain.com for help"

Weird "Stale file handle, errno=116" on remote cluster after dozens of hours running

I'm now running a simulation code called CMAQ on a remote cluster. I first ran a benchmark test in serial to see the performance of the software. However, the job always runs for dozens of hours and then crashes with the following "Stale file handle, errno=116" error message:
PBS Job Id: 91487.master.cluster
Job Name: cmaq_cctm_benchmark_serial.sh
Exec host: hs012/0
An error has occurred processing your job, see below.
Post job file processing error; job 91487.master.cluster on host hs012/0Unknown resource type REJHOST=hs012.cluster MSG=invalid home directory '/home/shangxin' specified, errno=116 (Stale file handle)
This is very strange because I never modify the home directory and this "/home/shangxin/" is surely my permanent directory where the code is....
Also, in the standard output .log file, the following message is always shown when the job fails:
Bus error
100247.930u 34.292s 27:59:02.42 99.5% 0+0k 16480+0io 2pf+0w
What does this message mean specifically?
I once thought this error is due to that the job consumes the RAM up and this is a memory overflow issue. However, when I logged into the computing node while running to check the memory usage with "free -m" and "htop" command, I noticed that both the RAM and swap memory occupation never exceed 10%, at a very low level, so the memory usage is not a problem.
Because I used "tee" to record the job running to a log file, this file can contain up to tens of thousands of lines and the size is over 1MB. To test whether this standard output overwhelms the cluster system, I ran another same job but without the standard output log file. The new job still failed with the same "Stale file handle, errno=116" error after dozens of hours, so the standard output is also not the reason.
I also tried running the job in parallel with multiple cores, it still failed with the same error after dozens of hours running.
I can make sure that the code I'm using has no problem because it can successfully finish on other clusters. The administrator of this cluster is looking into the issue but also cannot find out the specific reasons for now.
Has anyone ever run into this weird error? What should we do to fix this problem on the cluster? Any help is appreciated!
On academic clusters, home directories are frequently mounted via NFS on each node in the cluster to give you a uniform experience across all of the nodes. If this were not the case, each node would have its own version of your home directory, and you would have to take explicit action to copy relevant files between worker nodes and/or the login node.
It sounds like the NFS mount of your home directory on the worker node probably failed while your job was running. This isn't a problem you can fix directly unless you have administrative privileges on the cluster. If you need to make a work-around and cannot wait for sysadmins to address the problem, you could:
Try using a different network drive on the worker node (if one is available). On clusters I've worked on, there is often scratch space or other NFS directly under root /. You might get lucky and find an NFS mount that is more reliable than your home directory.
Have your job work in a temporary directory local to the worker node, and write all of its output files and logs to that directory. At the end of your job, you would need to make it copy everything to your home directory on the login node on the cluster. This could be difficult with ssh if your keys are in your home directory, and could require you to copy keys to the temporary directory which is generally a bad idea unless you restrict access to your keys with file permissions.
Try getting assigned to a different node of the cluster. In my experience, academic clusters often have some nodes that are more flakey than others. Depending on local settings, you may be able to request certain nodes directly, or potentially request resources that are only available on stable nodes. If you can track which nodes are unstable, and you find your job assigned to an unstable node, you could resubmit your job, then cancel the job that is on an unstable node.
The easiest solution is to work with the cluster administrators, but I understand they don't always work on your schedule.

What is P4LOG? Why it's filled up?

In my company we're using a perforce and last time we started to receive "The filesystem 'P4LOG' has only XXX free, but the server configuration requires at least YYY available". What does it means. I tried to google and all I found is that this is env variable - Name and path of the file to which Perforce errors are written. We asked our administrator to fix the problem and he increased min file size, that removes an error for some time but later it appears again. Why this warning appears? How to prevent it?
Your administrator can either set up Perforce's specific log rotation tool, or use the standard Linux logrotate tool. This is essential if you don't want to keep running out of disk space on the Perforce server.
As the p4 logrotate docs mention, p4 logrotate happens automatically on a checkpointing or journaling event. You should have checkpointing and journaling set up so that you don't run the risk of losing your entire depot if the p4 server has a problem.
Your administrator needs to let up log rotation. Log rotation will compress recent logs and remove old ones.

how can i tell (in bash script) if a clonezilla batch mode backup succeeded?

This is my first ever post to stackoverflow, so be gentle please. ;>
Ok, I'm using slightly customized Clonezilla-Live cd's to backup the drives on four PCs. Each cd is for a specific PC, saving an image of its disk(s) to a box-specific backup folder on a samba server. That's all pretty much working. But once in a while, Something Goes Wrong, and the backup isn't completed properly. Things like: the cat bit through a cat5e cable; I forgot to check if the samba server had run out of room; etc. And it is not always readily apparent that a failure happened.
I will admit right now that I am pretty much a noob as far as linux system administration goes, even though i managed somehow to setup a centos 6 box (i wish i'd picked ubuntu...) with samba, git, ssh, and bitnami-gitlab back in february.
I've spent days and days and days trying to figure out if clonezilla leaves a simple clue in a backup as to whether it succeeded completely or not, and have come up dry. Looking in the folder for a particular backup job (on the samba server) I see that the last file written is named "clonezilla-img". It seems to be a console dump that covers the backup itself. But it does not seem to include the verification pass.
Regardless of whether the batch backup task succeeded or failed, I can run a post-process bash script automagically, that I place on my clonezilla cds. I have this set to run just fine, though its not doing a whole lot right now. What I would like this post-process script to do is determine if the backup job succeeded or not, and then rename (mv) the backup job directory to include some word like "SUCCESS" or "FAILURE". I know how to do the renaming part. It's the test for success or failure that I'm at a loss about.
Thanks for any help!
I know this is old, but I've just started on looking into doing something very similar.
For your case i think you could do what you are looking for with ocs_prerun and ocs_postrun scripts.
For my setup I'm using a pen/falsh drive for some test systems and also pxe with NFSmount. PXE and nfs are much easier to test and modify quickly.
I haven't tested this yet, but I was thinking that I might be able to search the logs in /var/log/{clonezilla.log,partclone.log} via an ocs_postrun script to validate success or failure. I haven't seen anything that indicates the result is set in the environment so I'm thinking the logs might be the quick easy method over mounting or running a crc check. Clonezilla does have an option to validate the image, the results of which might be in the local logs.
Another option might be to create a custom ocs_live_run script to do something similar. There is an example at this URL http://clonezilla.org/fine-print-live-doc.php?path=./clonezilla-live/doc/07_Customized_script_with_PXE/00_customized_script_with_PXE.doc#00_customized_script_with_PXE.doc
Maybe in the script the exit code of ocs-sr can be checked? As I said I haven't tried any of this, just some thoughts.
I updated the above to reflect the log location (/var/log). The logs are in the log folder of course. :p
Regards

Process text files ftp'ed into a set of directories in a hosted server

The situation is as follows:
A series of remote workstations collect field data and ftp the collected field data to a server through ftp. The data is sent as a CSV file which is stored in a unique directory for each workstation in the FTP server.
Each workstation sends a new update every 10 minutes, causing the previous data to be overwritten. We would like to somehow concatenate or store this data automatically. The workstation's processing is limited and cannot be extended as it's an embedded system.
One suggestion offered was to run a cronjob in the FTP server, however there is a Terms of service restriction to only allow cronjobs in 30 minute intervals as it's shared-hosting. Given the number of workstations uploading and the 10 minute interval between uploads it looks like the cronjob's 30 minute limit between calls might be a problem.
Is there any other approach that might be suggested? The available server-side scripting languages are perl, php and python.
Upgrading to a dedicated server might be necessary, but I'd still like to get input on how to solve this problem in the most elegant manner.
Most modern Linux's will support inotify to let your process know when the contents of a diretory has changed, so you don't even need to poll.
Edit: With regard to the comment below from Mark Baker :
"Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files."
That will happen with the inotify watch you set on the directory level - the way to make sure you then don't pick up the partial file is to set a further inotify watch on the new file and look for the IN_CLOSE event so that you know the file has been written to completely.
Once your process has seen this, you can delete the inotify watch on this new file, and process it at your leisure.
You might consider a persistent daemon that keeps polling the target directories:
grab_lockfile() or exit();
while (1) {
if (new_files()) {
process_new_files();
}
sleep(60);
}
Then your cron job can just try to start the daemon every 30 minutes. If the daemon can't grab the lockfile, it just dies, so there's no worry about multiple daemons running.
Another approach to consider would be to submit the files via HTTP POST and then process them via a CGI. This way, you guarantee that they've been dealt with properly at the time of submission.
The 30 minute limitation is pretty silly really. Starting processes in linux is not an expensive operation, so if all you're doing is checking for new files there's no good reason not to do it more often than that. We have cron jobs that run every minute and they don't have any noticeable effect on performance. However, I realise it's not your rule and if you're going to stick with that hosting provider you don't have a choice.
You'll need a long running daemon of some kind. The easy way is to just poll regularly, and probably that's what I'd do. Inotify, so you get notified as soon as a file is created, is a better option.
You can use inotify from perl with Linux::Inotify, or from python with pyinotify.
Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files.
With polling it's less likely you'll see partial files, but it will happen eventually and will be a nasty hard-to-reproduce bug when it does happen, so better to deal with the problem now.
If you're looking to stay with your existing FTP server setup then I'd advise using something like inotify or daemonized process to watch the upload directories. If you're OK with moving to a different FTP server, you might take a look at pyftpdlib which is a Python FTP server lib.
I've been a part of the dev team for pyftpdlib a while and one of more common requests was for a way to "process" files once they've finished uploading. Because of that we created an on_file_received() callback method that's triggered on completion of an upload (See issue #79 on our issue tracker for details).
If you're comfortable in Python then it might work out well for you to run pyftpdlib as your FTP server and run your processing code from the callback method. Note that pyftpdlib is asynchronous and not multi-threaded, so your callback method can't be blocking. If you need to run long-running tasks I would recommend a separate Python process or thread be used for the actual processing work.

Resources