Windows/Python check if file is open or in use - python-3.x

I am using python to monitor a folder and check if files are being copied in and if so, replicate those to a new location.
I am using the following to monitor the folder:
fsmonitor
The issue I am facing is that I am unable to discern if the file is in use and currently in the process of writing the contents onto disk. if so I want to wait till copying is complete and then start copying it to my new location.
So how do I find out if a file is in use/open?
I have seen some suggestions here where I try to write to the file question and if it fails then it indicates that the file is in use:
example answer (I've seen similar in python)
But I am reluctant to use such a method due to the fear that it might cause corruption and such issues.
Is there an alternative/safer way to do this? Or is testing write permissions safe?
Is anyone familiar with pywin32? Does it provide such tools? The site looks arcane, so wonder if it has the latest API provided by windows, even fsmointor mentioned above uses the same library and I wonder if there are newer/more efficient ways to do this.
Currently, I am using psutil, proc.open_files() to loop through all processes and all files to list out open files. if files that I am concerned about appear on this list I wait and try again. However, this process creates a humongous list of files and uses 12% of my CPU to create it, so I desperately need an alternative.
In response to Adrian McCarthy
I starting out assuming that it is safe to action whatever fsmonitor puts out, but if you see the following output which si for a single file copy:
0 86 0
create C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe 3684bf38
create C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe 3684bf38
0 86 0
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe a8cf3250
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe a8cf3250
0 160 0
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64 - Copy.exe caef5c64
modify C:\Users\ScanUser\Pictures\syncTest dotnet-sdk-5.0.203-win-x64.exe caef5c64
So the conundrum is at which 'modify' do I start copying the file? I can wait a few minutes/seconds to see if another 'modified' appeared for that file but how do I decide the time to wait for a large file over SFTP may take 30 minutes, so I need something scalable.
Also, I would like not the make multiple copy actions for a file since that will make the script inefficient.

This can help you
check if a file is open in Python
here is a code:
try: # try to open the file
with open("file", "r") as file:
# some code here
except IOError:
# if it throws an error that means it is in use

I think you're unnecessarily concerned about working with the file while another process still has it open.
On Windows. fsmonitor using the ReadDirectoryChangesW mechanism. That means you'll get a notification about a change after it happens. So if a process writes to foo.log, you'll get a notification after the write operation is completed. (In fact, I think it's after the update of the directory metadata.)
To copy the file, you need read access. So just go ahead and open it for reading.
If it opens, then it's safe to read, even if another process has it open. You cannot corrupt a file by reading it even if another process is writing to it.
If it fails to open, then another process has it open and is intentionally preventing other processes from reading it (probably because they know they'll be actively updating it). In that case, you can try again later.
Trying to first check whether another process is using the file doesn't actually help because the answer could change between the moment you check and the moment you try to act on that information.
When you open a file, the system does the permission check and the opening under a mutex*, so the answer cannot change in between. There's no way for you to simulate that yourself from user-mode code. Once you have the file open, you can safely use it.
If you try to read from a file at the same moment another process tries to write to it, the system will ensure that the read will get the data as it was before the write or as it is after the write. It won't get a result that's a mixture of old and new.
That said, if you're reading the file with a bunch of small read operations while another process is writing to the file with a bunch of small write operations, it's possible you might capture some intermediate state of the file. But that's okay. The original file is unharmed, and those writes will trigger another fsmonitor notification, so you're code will start over and try to make another copy of the file.
* I'm using "mutex" in a generic sense: It uses some sort of synchronization mechanism, but it might not necessarily be a Windows Mutex object.

Related

Node.js read and write stream to the same file at the same time

TL;DR
I'm browsing through a number of solutions on npm and github looking for something that would allow me to read and write to the same file in two different places at the same time. So far I'm having trouble actually finding anything like this. Is there a module of some sort that will allow that?
Background
In essence my requirement is that in a large file I need to, in the following order:
read
transform
write
Ideally the usage would be something like:
const fd = fs.open(file, "r+");
const read = createReadStreamSomehowFrom(fd);
const write = createWriteStreamSomehowFrom(fd);
read
.pipe(new Transform(transform() {...}))
.pipe(write);
I could do that with standard fs.create[Read/Write]Stream but there's no way to control the flow of both streams and if my write position goes beyond read position then I'm reading something I just wrote...
The use case is the same as perl -p -i -e, read and write to the same file (meaning the same inode) asynchronously and replace the contents without loading everything into memory.
I would expect this a real world use case, yet all implementations I found actually load the whole file into memory and then save it. Am I missing a known module here or is there a need to actually write something like this?
Hmm... a tough one it seems. :)
So here's for the record - I found no such module and actually discussed this with some people responsible for a nice in-file replacing module. Seeing no way to solve this I decided to write it from scratch and here it is:
signicode/rw-stream repo on github
rw-stream at npm
The module works on a simple principle that no byte can be written until it has been consumed in the readable stream and it's fairly simple underneath (couple fs.read/write ops with keeping eye on the point of read and write).
If you find this useful then I'm happy. :)

Is there a way to have a ioctl() with new(customized) command

I am working on a testing tool for nvme-cli(written in c and can run on linux).
For SSD validation purpose, i was actually looking for a custom command(For e.g. I/O command, write and then read the same and finally compare if both the data are same)
In user space i need to invoke minimum of 2 ioclt() one with write command(nvme_cmd_write) and another with read command(nvme_cmd_read) and compare both the buffer contents.
Issue is actually when i wanted to send this command in parallel. At block level (using ioclt())we were not able to put this command in different I/O submission queues.
so can we have a custom command (nvme_cmd_write_compare) sent from ioclt() and have a new module at driver level for handling this new command.
Since I am new to this nvme/ioctl(), if there is any mistakes please correct me.
I wanted to know if we can implement this.

Stream definition: Ignore all files but one filetype

We have a server with a depot that does not allow committing files which are in a client mapping therefore I need a stream configuration.
Now I struggle with a task which I would assume should be simple:
We have a very large stream with lots of different file types and I would like to check out the entire stream but get only a certain file type.
Can this be done with perforce without black-listing every file type in question?
Edit: Sorry that I (for some reason omitted) so many information in my question.
I am already setting up a virtual stream where the UI gives me three nice fields:
Paths – where I can enter import, share isolate paths
Remapping – ignored in my case
Ignored – here I can enter wildcards to ignore directories or files
I was hoping that by creating a virtual stream I actually could define the file types I want, e.g. I could write an import statement like
import RootDir/....txt //Depot/mainline/RootDir/....txt (note the 4 dots, 3 for perforce and the other as a "wildcard"
however the stream definition does not support this and only allows me to write
import RootDir/... //Depot/mainline/RootDir/...
Since I was not able to find a way to white list the files I wanted I only knew a way to blacklist all things I did not want but I would like to avoid that because my Ignored list would be dozens of entries long.
Now I will look into that sync hint because I could use the full stream spec without filter and only sync the files I need on disk, which might be very good.
There are a few different things going on in your question but this seems the most like a statement of what you're trying to do so I'm going to zero in on it:
I would like to check out the entire stream but get only a certain
file type.
If by "check out" you mean you only want to sync that file type to your local workspace:
p4 sync ....TXT
If by "check out" you mean you want to open only that file type for edit:
p4 edit ....TXT
ANY operation in Perforce that operates on files accepts an arbitrary file path, because Perforce tracks all of its state per-file. This is true whether you're using classic clients or streams.
There needs to be some mechanism for telling the Helix (Perforce) server that you only want to retrieve certain files from the stream.
Virtual Streams may be a good fit here, as they allow you to filter the view of an existing stream.
This means you can sync only the files you want and when you submit you will be submitting directly back to the stream your virtual stream is based on.
More information is available here:
https://www.perforce.com/perforce/doc.current/manuals/p4v/p4v_virtual_streams.html

How to close file resources in pyglet

This is definitely a repeat of this question, but seeing as how that has gotten 0 replies in 3 months, and I can't seem to find an answer. The question ought to be simple: once you're done with a file (say, a video or a sound) in pyglet, how do you go about closing that file? I have an application which has to iterate over a few hundred thousand files, processing each one in turn. For obvious reasons, I am getting OSError: Too many open files. Is there a way to force-close pyglet's files?
For sound or video files you can close the current source with the Player delete() method.
You can also load the resource directly into memory instead of streaming from disk by setting the streaming argument to false in the load call:
pyglet.media.load(filename, streaming=False)
If all else fails, try forcing the garbage collector with python.gc.collect().

Syncing two files when one is still being written to

I have an application (video stream capture) which constantly writes its data to a single file. Application typically runs for several hours, creating ~1 gigabyte file. Soon (in a matter of several seconds) after it quits, I'd like to have 2 copies of file it was writing - let's say, one in /mnt/disk1, another in /mnt/disk2 (the latter is an USB flash drive with FAT32 filesystem).
I don't really like an idea of modifying the application to write 2 copies simulatenously, so I though of:
Application starts and begins to write the file (let's call it /mnt/disk1/file.mkv)
Some utility starts, copies what's already there in /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
After getting initial sync state, it continues to follow a written file in a manner like tail -f does, copying everything it gets from /mnt/disk1/file.mkv to /mnt/disk2/file.mkv
Several hours pass
Application quits, we stop our syncing utility
Afterwards, we run a quick rsync /mnt/disk1/file.mkv /mnt/disk2/file.mkv just to make sure they're the same. In case if they're the same, it should just run a quick check and quit fairly soon.
What is the best approach for syncing 2 files, preferably using simple Linux shell-available utilities? May be I could use some clever trick with FUSE / md device / tee / tail -f?
Solution
The best possible solution for my case seems to be
mencoder ... -o >(
tee /mnt/disk1/file.mkv |
tee /mnt/disk2/file.mkv |
mplayer -
)
This one uses bash/zsh-specific magic named "process substitution" thus eliminating the need to make named pipes manually using mkfifo, and displays what's being encoded, as a bonus :)
Hmmm... the file is not usable while it's being written, so why don't you "trick" your program into writing through a pipe/fifo and use a 2nd, very simple program, to create 2 copies?
This way, you have your two copies as soon as the original process ends.
Read the manual page on tee(1).

Resources