How to efficiently work with a task stream?

How to efficiently work with a task stream? - perforce

Disclaimer: I let the question open, but my slow branching issue was due to the server being overloaded. So this is not the usual Perforce behavior. I now take about 30 seconds to branch 10K files.
I am a new Perforce 2014 user. I have created a stream depot, and put the legacy application (about 10,000 cpp files) in a development branch. That was relatively quick for an initial import, about 1 hour to upload everything.
Now I want to create a 'lightweight' task stream to work on a new feature.
I use the default menu > new stream > type task ... I select "Branch file from parent on stream creation"
To my surprise, it takes ages (about 1 hour) to create a new task because it is branching individually every file. I would expect the process to be almost instant coming from other SCM tools. (git, svn,...)
Now my question is :
is this the expected behavior ?
Alternatively, is there a way to create more quickly a Task, and only branch the file I intend to modify ?

Creating a new task stream of 10k files SHOULD be a very quick operation (on the order of a second or two?) since no actual file content is being transferred, assuming you're starting with a synced workspace. If you're creating a brand new workspace for the new stream, then part of creating the new workspace is going to be syncing the files; I'd expect that to take about as long as the submit did since it's the same amount of data being transferred.
Make sure when creating a new stream that you're not creating a new workspace. In the visual client there's an option to "create a workspace"; make sure to uncheck that box or it'll make a new workspace and then sync it, which is the part that'll take an hour.
From the command line, starting from a workspace of //stream/parent, here's what you'd do to make a new task stream:
p4 stream -t task -P //stream/parent //stream/mynewtask01
p4 populate -r -S //stream/mynewtask01
p4 client -s -S //stream/mynewtask01
p4 sync
The "stream" and "client" commands don't actually operate on any files, so they'll be really quick no matter what. The "populate" will branch all 10k files, but it does it on the back end without actually moving any content around, so it'll also be really quick (if you got up into the millions or billions it might take an appreciable amount of time depending on the server hardware, but 10k is nothing). The "sync" will be very quick if you were already synced to //stream/parent, because all the files are already there; again, it's just moving pointers around on the server side rather than transferring the file content.

I have found a workaround, which speed up things a bit, and is actually closer to what I intend to do :
Task streams are created by default with the following "stream view"
share ...
I have replaced it by
import ...
share /directory/I/actually/want/to/modify/...
So I skip the branching of most of the file, and it is working fine.

Related

"Just in time" read only filesystem using mkfifo and inotifywait

I am writing some gross middleware - basically, I have some old code that needs to open 100,000 files for reading only, expecting them all to be in one folder. It never writes. It is multiprocess so it can try to open ~30 files at the same time. The old way, I would have to actually copy the files into that folder (or use links, NFS, etc.). Worth noting I have no ability to change this old code - its just a binary.
I have some new, fancy code that can retrieve a file almost instantly. I want to tie these things together, so when the old code tries to open the file, it is actually, in real time, running the new code.
So I thought of mkfifo and inotifywait. Instead of a folder of 100,000 files, I can make a folder of 100,000 named pipes. So far so good. The legacy code goes to open the files, not knowing that they are indeed named pipes. The problem is, I don't know what order the legacy code is going to open the files (nice, right?). So I would like to TRIGGER the named pipe WRITE (from my fancy new code) when the legacy code goes in for the read. I can't spawn 100,000 writes and have them all block. So I thought hey - inotifywait makes sense. Every time the legacy goes to open the pipe, it triggers a read event, which can then be used to spawn the pipe writer in the background. The problem is.. inotifywait doesn't trigger the read event until AFTER the writer has been spawned!
Any ideas of how to solve this? Basically - I want to intercept a file open, block for a couple hundred ms while I retrieve the contents of the file, then return that contents. Ideally I don't have to create a custom FUSE filesystem to do this.. its just a read-only file open. The problem is this needs to run fast and in parallel.. and I don't know which files are going to be opened in what order. Gotta be a quick and dirty way!
Thanks in advance for everyone's time.

How do I monitor changes of files and only look at them when the changes are finished?

I'm currently monitoring files in node.js using fs.watch. The problem I have is for example, let say I copy a 1gig file into a folder I'm watching. The moment the file starts copying I get a notification about the file. If I start reading it immediately I end up with bad data since the file has not finished copying. For example a zip file has it's table of contents at the end but I'd end up reading it before it's table of contents has been written.
Off the top of my head I could setup some task to call fs.stat on the file every N seconds and only try to read it when the stats stop changing. That would work but it seems not ideal as I'd like my app to be as responsive as possible and calling stat on a bunch of files every second seems heavy as well as calling stat every 5 or 10 seconds seems unresponsive.
Is there some more robust way to get notified when a file has finished being modified?

So I did a project last year which required doing "file watching". There is a better library out there than fs.watch. Check out npm chokidar.
https://www.npmjs.com/package/chokidar
Underneath it uses fs.watch, but wraps better improvements around it.
There is a property called awaitWriteFinish. Really it's doing some polling on the file to determine whether or not the file is finished writing. I used it and it really works great.
Setting this property will allow you to work against that file, always ensuring that the file has been completely written. And you don't need to go off and implement your own method of determining if the file is complete. Should save a bunch of time.
Aside from that, I don't believe you can really get away from polling with regard to determining if a file is finished writing. Chokidar is still polling, it's just that you don't need to write the logic to do it. And you can configure the polling interval if CPU utilization is deemed to be too high.
Edit: Would also like to add, to just give it a shot and see how it works. I get you want it as responsive as possible... But having something working is better than having something not working at all. It might be that even with a polling solution it's not even an issue for you. If it's deemed a performance problem, then go address it at that time and seek a "better" solution.

Best practice for using Kiba as a batch process on files

We'd like to run Kiba as a batch process on a series of files. What would be the best structure to give a file mask, download the files from FTP, and then run the ETL job on each, sending a success or failure notification on a per file basis?
Is there a way to do this from within Kiba, or is the best practice just to handle all the non-ETL stuff externally, and then just call kiba on each file?

I would initially start with the simplest possible thing, which is like you said, using external files then calling Kiba on each one. E.g. :
Build a rake task to download the files locally (and remove them from the FTP, or at least move them to a separate folder to avoid double-processing), inside a well-known folder which will act as an inbox. See here for interesting links on how to do that.
Then build another rake task to iterate over the inbox folder and process a given file (using Dir[pattern].each).
Make sure to use a helper such as:
def system!(command)
fail "Command #{command} failed" unless system(command)
end
to make sure you detect failures in execution when making system calls.
For your ETL file itself, you would use one at_exit block to capture failure and notify accordingly (see example here with Bugsnag, and a post_process block to capture success and notify in that case.
This will definitely work and is simple, that said there are other possibilities, such as a single ETL file which will download files in a pre_process block, then have a source which will yield one filename per downloaded file, and maybe a transform which could itself call kiba on the command line, or even more advanced solutions.
I would stick to the simplest possible solution to get started, as always!

"find" command cannot detect files added during execution

Stackoverflow has saved my life on countless occasions over the years. Now, it's time for me to post my first question ever, the answer to which I have been unable to find so far.
I have a tool (language/implementation is irrelevant) which accepts a text file as input. This text file (let's call it file_list.txt) contains a long list of file paths, one per line. The tool then iterates over the lines in file_list.txt and does something with every file path. This needs to be done continuously and file_list.txt needs to always contain the latest file paths because users continuously upload or delete files from the share being monitored. To achieve this, I have set up a cron job which calls a script. First the script calls the find utility with the search parameters required and pipes the output to a temporary file. When the file is fully populated, it is moved to file_list.txt. Then, once this is done, the tool is invoked with file_list.txt as an input parameter.
So far, so good. The share being monitored is VERY LARGE (~60 TB) and the find command takes around 5 hours to execute. This is not a problem since we have multiple overlapping find commands running in parallel (triggered once per hour). The entire setup runs on a compute farm, so CPU utilization, etc. is also not an issue.
The problem arises in the lag time for file detection. Ideally, I want a user to add a file and I want one of the already running, overlapping find commands to detect this file within a matter of minutes. However, I have noticed that none of the already-running find commands will detect this file. Only a find command started AFTER this file was added will detect it. This means that generally, I need to wait around 5 hours for a newly added file to be detected. This leads me to believe that the find utility somehow acts on a "cached" version of the share state when it was triggered. Is this true? Can anyone confirm this? And if so, what can I do to improve the detection lag?
Please let me know if further clarificaion is required. I am happy to provide any further details.

To summarize: you have a gigantic filesystem volume (60 TB) which contains a huge number of files, and you use find(1) to name a large number of those files and put those names into a text file for analysis. You have discovered that files are not listed if they are created after find(1) was started but before it finished.
I think the best solution is to stop thinking of this as a batch job, and do it "online" using inotify(7). You can use the inotify API to be immediately informed of changes to your filesystem, including new files being created. There is of course the original C API, as well as the excellent pyinotify.
With inotify, you can start a watcher program once and leave it running continuously (under a supervisor if needed for restarts). The operating system can then notify you whenever a relevant filesystem event occurs, and you can respond immediately rather than waiting for the next scan.
The one downside for your use case might be that the watcher program does need to run on a machine which has the filesystem mounted locally. But the overall compute resources required are probably much less than your current approach of repeated linear scans.

executing find commands and piping the output to temporary files might work up to a certain scale, but is far from optimal. If you want a less resource intensive, more reactive solution, I would recommend considering to reimplement your software using the inotify interface:
The inotify API provides a mechanism for monitoring filesystem events.
Inotify can be used to monitor individual files, or to monitor
directories. When a directory is monitored, inotify will return
events for the directory itself, and for files inside the directory.
So an event will be raised for each file change; or file being added.
Note that you can then keep an internal list of files up to date which only needs to be changed when you get a event.

Deleted Streams still present in Depot view in P4V

We're trying to delete streams for the depot that we no longer need.
In the Graph View I right-clicked and selected delete stream.
The stream was removed from the Graph View, however it is still present in the Depot view.
We want to totally remove it, but we can't seem to do it.
'p4 streams' doesn't know about it.
There are still some files in the stream. I wonder if we need to delete those files first.
Thanks,
- Rich

I wish the people who always answer these posts and say "you don't want to delete your depot!" would understand that people who ask this question are often brand new users who have just made their first depot, made a terrible mess of it, have no idea what they've done and would now like to wipe it and start completely fresh again.

Unless the stream is a task stream, deleting the stream only deletes the stream specification, not the files that have been submitted to the stream.
The files submitted to the stream are part of your permanent history; you want to keep them!
I often find myself referring back to the changelists and diffs of changes made many years ago, sometimes even decades ago, so I do everything possible never to erase old history.
If you really wish to destroy the permanent history of changes made to this stream, you can use the 'p4 obliterate' command, but be aware that this command cannot be undone.
If you are contemplating using obliterate to destroy the files submitted to this stream, you should consider contacting Perforce Technical Support first, as the obliterate command is complicated and has a number of options, and you want to make sure you use the right options. And take a checkpoint first, just for extra protection.
If you are using streams for temporary work, and frequently find yourself wishing to obliterate that work, consider using task streams.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string