GetAttributes uses wrong working directory in subthread - multithreading

I used File::Find to traverse a directory tree and Win32::File's GetAttributes function to look at the attributes of files found in it. This worked in a single-threaded program.
Then I moved the directory traversal into a separate thread, and it stopped working. GetAttributes failed on every file with "The system cannot find the file specified" as the error message in $^E.
I traced the problem to the fact that File::Find uses chdir, and apparently GetAttributes doesn't use the current directory. I could work around this by passing it an absolute path, but then I could run into path length limits, and long paths are definitely going to be present where this script will run, so I really need to take advantage of chdir and relative paths.
To demonstrate the problem, here is a script which creates a file in the current directory, another file in a subdirectory, chdir's to the subdirectory, and looks for the file 3 ways: system("dir"), open, and GetAttributes.
When the script is run without arguments, dir shows the subdirectory, open finds the file in the subdirectory, and GetAttributes returns its attributes successfully. When run with --thread, all the tests are done in a subthread, and the dir and open still work, but the GetAttributes fails. Then it calls GetAttributes on the file that is in the original directory (which we have chdir'ed out of) and it finds that one! Somehow GetAttributes is using the original working directory of the process - or maybe the working directory of the main thread - unlike all the other file operations.
How can I fix this? I can guarantee that the main thread won't do any chdir'ing, if that matters.
use strict;
use warnings;
use threads;
use Data::Dumper;
use Win32::File qw/GetAttributes/;
sub doit
{
chdir("testdir") or die "chdir: $!\n";
system "dir";
my $attribs;
open F, '<', "file.txt" or die "open: $!\n";
print "open succeeded. File contents:\n-------\n", <F>, "\n--------\n";
close F;
my $x = GetAttributes("file.txt", $attribs);
print Dumper [$x, $attribs, $!, $^E];
if(!$x) {
# If we didn't find the file we were supposed to find, how about the
# bad one?
$x = GetAttributes("badfile.txt", $attribs);
if($x) {
print "GetAttributes found the bad file!\n";
if(open F, '<', "badfile.txt") {
print "opened the bad file\n";
close F;
} else {
print "But open didn't open it. Error: $! ($^E)\n";
}
}
}
}
# Setup
-d "testdir" or mkdir "testdir" or die "mkdir testdir: $!\n";
if(!-f "badfile.txt") {
open F, '>', "badfile.txt" or die "create badfile.txt: $!\n";
print F "bad\n";
close F;
}
if(!-f "testdir/file.txt") {
open F, '>', "testdir/file.txt" or die "create testdir/file.txt: $!\n";
print F "hello\n";
close F;
}
# Option 1: do it in the main thread - works fine
if(!(#ARGV && $ARGV[0] eq '--thread')) {
doit();
}
# Option 2: do it in a secondary thread - GetAttributes fails
if(#ARGV && $ARGV[0] eq '--thread') {
my $thr = threads->create(\&doit);
$thr->join();
}

Eventually, I figured out that perl is maintaining some kind of secondary cwd that only applies to perl built-in operators, while GetAttributes is using the native cwd. I don't know why it does this or why it only happens in the secondary thread; my best guess is that perl is trying to emulate the unix rule of one cwd per process, and failing because the Win32::* modules don't play along.
Whatever the reason, it's possible to work around it by forcing the native cwd to be the same as perl's cwd whenever you're about to do a Win32::* operation, like this:
use Cwd;
use Win32::FindFile qw/SetCurrentDirectory/;
...
SetCurrentDirectory(getcwd());
Arguably File::Find should do this when running on Win32.
Of course this only makes the "pathname too long" problem worse, because now every directory you visit will be the target of an absolute-path SetCurrentDirectory; try to work around it with a series of smaller SetCurrentDirectory calls and you have to figure out a way to get back where you came from, which is hard when you don't even have fchdir.

Related

Printing all files in a directory in Perl - will not work

So I am new to Perl and trying to simply open a directory, and list all its files. When I run this very simple code below trying to print everything in /usr/bin it will not work, and no matter what I try I keep getting told 'Could not open /usr/bin: No such file or directory'.
Any help would be much appreciated!
#!/usr/bin/perl
$indir = "/usr/bin";
# read in all files from the directory
opendir (DIR, #indir) or die "Could not open $indir: $!\n";
while ($filename = readdir(DIR)) {
print "$filename\n";
}
closedir(DIR);
Here is another place where the very basic troubleshooting step of use strict; and use warnings; has been omitted, and it would have told you exactly what was wrong.
Global symbol "#indir" requires explicit package name (did you forget to declare "my #indir"?)
Of course, you'd also have to fix a few other errors (e.g. my $indir = '/usr/bin';)
I would also suggest that readdir is not well suited for this job, and would tend to recommend glob:
#!/usr/bin/env perl
use strict;
use warnings;
my $indir = "/usr/bin";
# read in all files from the directory
foreach my $filename ( glob "$indir/*" ) {
print "$filename\n";
}
Note how this differs - it prints a full path to the file, and it omits certain things (like . and ..) which is in my opinion, more generally useful. Not least because another really common error is to open my $fh, '<', $filename or die $!, forgetting that it's not in the current working directory.

the file says it's there, but it's not really there at all

What condition could cause this?
a dentry for the file exists
-e says the file is there
cat says "No such file or directory"
opendir(my $dh, $tdir) or die "Could not open temp dir: $!";
for my $file (readdir($dh)) {
print STDERR "checking file $tdir/$file\n";
next unless -e "$tdir/$file";
print STDERR "here's the file for $tdir/$file: ";
print STDERR `cat $tdir/$file`;
die "Not all files from hub '$hid' have been collected!: $tdir/$file";
}
Output:
checking file
/tmp/test2~22363~0G0Tjv/22363~0~4~22380~0~1~Test2~Event~Note
here's the file for /tmp/test2~22363~0G0Tjv/22363~0~4~22380~0~1~Test2~Event~Note:
cat: /tmp/test2~22363~0G0Tjv/22363~0~4~22380~0~1~Test2~Event~Note: No such file or directory
IPC Fatal Error: Not all files from hub '22363~0~4' have been collected!:
I realize that theoretically some other process or thread could be stepping on my file in between cycles, but this is Perl (single-threaded) and Test2/Test::Simple's test directory, which should be limited to being managed by the master process.
What's brought me here is that I've seen similar errors in other code we have:
if(-e $cachepath) {
unlink $cachepath
or Carp::carp("cannot unlink...
which throws "cannot unlink..."
also
$cache_mtime = (stat($cachepath))[_ST_MTIME];
....
if ($cache_mtime){
open my($in), '<:raw', $cachepath
or $self->_error("LoadError: Cannot open $cachepath for reading:
which throws that "LoadError: cannot open..." exception
Those are in cases which also should only have the one process operating on the directory, and that all makes me suspect there's something going on.
This is linux kernel, fully up-to-date with vendor patches.
It turns out it really was child processes and race conditions. The newer code in Test::Simple doesn't work well with nested subtests. Anyway, thanks for allowing me to rule out some obscure filesystem thing that I didn't know about.

Relinking an anonymous (unlinked but open) file

In Unix, it's possible to create a handle to an anonymous file by, e.g., creating and opening it with creat() and then removing the directory link with unlink() - leaving you with a file with an inode and storage but no possible way to re-open it. Such files are often used as temp files (and typically this is what tmpfile() returns to you).
My question: is there any way to re-attach a file like this back into the directory structure? If you could do this it means that you could e.g. implement file writes so that the file appears atomically and fully formed. This appeals to my compulsive neatness. ;)
When poking through the relevant system call functions I expected to find a version of link() called flink() (compare with chmod()/fchmod()) but, at least on Linux this doesn't exist.
Bonus points for telling me how to create the anonymous file without briefly exposing a filename in the disk's directory structure.
A patch for a proposed Linux flink() system call was submitted several years ago, but when Linus stated "there is no way in HELL we can do this securely without major other incursions", that pretty much ended the debate on whether to add this.
Update: As of Linux 3.11, it is now possible to create a file with no directory entry using open() with the new O_TMPFILE flag, and link it into the filesystem once it is fully formed using linkat() on /proc/self/fd/fd with the AT_SYMLINK_FOLLOW flag.
The following example is provided on the open() manual page:
char path[PATH_MAX];
fd = open("/path/to/dir", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
/* File I/O on 'fd'... */
snprintf(path, PATH_MAX, "/proc/self/fd/%d", fd);
linkat(AT_FDCWD, path, AT_FDCWD, "/path/for/file", AT_SYMLINK_FOLLOW);
Note that linkat() will not allow open files to be re-attached after the last link is removed with unlink().
My question: is there any way to re-attach a file like this back into the directory structure? If you could do this it means that you could e.g. implement file writes so that the file appears atomically and fully formed. This appeals to the my compulsive neatness. ;)
If this is your only goal, you can achieve this in a much simpler and more widely used manner. If you are outputting to a.dat:
Open a.dat.part for write.
Write your data.
Rename a.dat.part to a.dat.
I can understand wanting to be neat, but unlinking a file and relinking it just to be "neat" is kind of silly.
This question on serverfault seems to indicate that this kind of re-linking is unsafe and not supported.
Thanks to #mark4o posting about linkat(2), see his answer for details.
I wanted to give it a try to see what actually happened when trying to actually link an anonymous file back into the filesystem it is stored on. (often /tmp, e.g. for video data that firefox is playing).
As of Linux 3.16, there still appears to be no way to undelete a deleted file that's still held open. Neither AT_SYMLINK_FOLLOW nor AT_EMPTY_PATH for linkat(2) do the trick for deleted files that used to have a name, even as root.
The only alternative is tail -c +1 -f /proc/19044/fd/1 > data.recov, which makes a separate copy, and you have to kill it manually when it's done.
Here's the perl wrapper I cooked up for testing. Use strace -eopen,linkat linkat.pl - </proc/.../fd/123 newname to verify that your system still can't undelete open files. (Same applies even with sudo). Obviously you should read code you find on the Internet before running it, or use a sandboxed account.
#!/usr/bin/perl -w
# 2015 Peter Cordes <peter#cordes.ca>
# public domain. If it breaks, you get to keep both pieces. Share and enjoy
# Linux-only linkat(2) wrapper (opens "." to get a directory FD for relative paths)
if ($#ARGV != 1) {
print "wrong number of args. Usage:\n";
print "linkat old new \t# will use AT_SYMLINK_FOLLOW\n";
print "linkat - <old new\t# to use the AT_EMPTY_PATH flag (requires root, and still doesn't re-link arbitrary files)\n";
exit(1);
}
# use POSIX qw(linkat AT_EMPTY_PATH AT_SYMLINK_FOLLOW); #nope, not even POSIX linkat is there
require 'syscall.ph';
use Errno;
# /usr/include/linux/fcntl.h
# #define AT_SYMLINK_NOFOLLOW 0x100 /* Do not follow symbolic links. */
# #define AT_SYMLINK_FOLLOW 0x400 /* Follow symbolic links. */
# #define AT_EMPTY_PATH 0x1000 /* Allow empty relative pathname */
unless (defined &AT_SYMLINK_NOFOLLOW) { sub AT_SYMLINK_NOFOLLOW() { 0x0100 } }
unless (defined &AT_SYMLINK_FOLLOW ) { sub AT_SYMLINK_FOLLOW () { 0x0400 } }
unless (defined &AT_EMPTY_PATH ) { sub AT_EMPTY_PATH () { 0x1000 } }
sub my_linkat ($$$$$) {
# tmp copies: perl doesn't know that the string args won't be modified.
my ($oldp, $newp, $flags) = ($_[1], $_[3], $_[4]);
return !syscall(&SYS_linkat, fileno($_[0]), $oldp, fileno($_[2]), $newp, $flags);
}
sub linkat_dotpaths ($$$) {
open(DOTFD, ".") or die "open . $!";
my $ret = my_linkat(DOTFD, $_[0], DOTFD, $_[1], $_[2]);
close DOTFD;
return $ret;
}
sub link_stdin ($) {
my ($newp, ) = #_;
open(DOTFD, ".") or die "open . $!";
my $ret = my_linkat(0, "", DOTFD, $newp, &AT_EMPTY_PATH);
close DOTFD;
return $ret;
}
sub linkat_follow_dotpaths ($$) {
return linkat_dotpaths($_[0], $_[1], &AT_SYMLINK_FOLLOW);
}
## main
my $oldp = $ARGV[0];
my $newp = $ARGV[1];
# link($oldp, $newp) or die "$!";
# my_linkat(fileno(DIRFD), $oldp, fileno(DIRFD), $newp, AT_SYMLINK_FOLLOW) or die "$!";
if ($oldp eq '-') {
print "linking stdin to '$newp'. You will get ENOENT without root (or CAP_DAC_READ_SEARCH). Even then doesn't work when links=0\n";
$ret = link_stdin( $newp );
} else {
$ret = linkat_follow_dotpaths($oldp, $newp);
}
# either way, you still can't re-link deleted files (tested Linux 3.16 and 4.2).
# print STDERR
die "error: linkat: $!.\n" . ($!{ENOENT} ? "ENOENT is the error you get when trying to re-link a deleted file\n" : '') unless $ret;
# if you want to see exactly what happened, run
# strace -eopen,linkat linkat.pl
Clearly, this is possible -- fsck does it, for example. However, fsck does it with major localized file system mojo and will clearly not be portable, nor executable as an unprivileged user. It's similar to the debugfs comment above.
Writing that flink(2) call would be an interesting exercise. As ijw points out, it would offer some advantages over current practice of temporary file renaming (rename, note, is guaranteed atomic).
Kind of late to the game but I just found http://computer-forensics.sans.org/blog/2009/01/27/recovering-open-but-unlinked-file-data which may answer the question. I haven't tested it, though, so YMMV. It looks sound.

How can I change the current directory in a thread-safe manner in Perl?

I'm using Thread::Pool::Simple to create a few working threads. Each working thread does some stuff, including a call to chdir followed by an execution of an external Perl script (from the jbrowse genome browser, if it matters). I use capturex to call the external script and die on its failure.
I discovered that when I use more then one thread, things start to be messy. after some research. it seems that the current directory of some threads is not the correct one.
Perhaps chdir propagates between threads (i.e. isn't thread-safe)?
Or perhaps it's something with capturex?
So, how can I safely set the working directory for each thread?
** UPDATE **
Following the suggestions to change dir while executing, I'd like to ask how exactly should I pass these two commands to capturex?
currently I have:
my #args = ( "bin/flatfile-to-json.pl", "--gff=$gff_file", "--tracklabel=$track_label", "--key=$key", #optional_args );
capturex( [0], #args );
How do I add another command to #args?
Will capturex continue die on errors of any of the commands?
I think that you can solve your "how do I chdir in the child before running the command" problem pretty easily by abandoning IPC::System::Simple as not the right tool for the job.
Instead of doing
my $output = capturex($cmd, #args);
do something like:
use autodie qw(open close);
my $pid = open my $fh, '-|';
unless ($pid) { # this is the child
chdir($wherever);
exec($cmd, #args) or exit 255;
}
my $output = do { local $/; <$fh> };
# If child exited with error or couldn't be run, the exception will
# be raised here (via autodie; feel free to replace it with
# your own handling)
close ($fh);
If you were getting a list of lines instead of scalar output from capturex, the only thing that needs to change is the second-to-last line (to my #output = <$fh>;).
More info on forking-open is in perldoc perlipc.
The good thing about this in preference to capture("chdir wherever ; $cmd #args") is that it doesn't give the shell a chance to do bad things to your #args.
Updated code (doesn't capture output)
my $pid = fork;
die "Couldn't fork: $!" unless defined $pid;
unless ($pid) { # this is the child
chdir($wherever);
open STDOUT, ">/dev/null"; # optional: silence subprocess output
open STDERR, ">/dev/null"; # even more optional
exec($cmd, #args) or exit 255;
}
wait;
die "Child error $?" if $?;
I don't think "current working directory" is a per-thread property. I'd expect it to be a property of the process.
It's not clear exactly why you need to use chdir at all though. Can you not launch the external script setting the new process's working directory appropriately instead? That sounds like a more feasible approach.

"Copy failed: File too large" error in perl

Ok so i have 6.5 Million images in a folder and I need to get them moved asap. I will be moving them into their own folder structure but first I must get them moved off this server.
I tried rsync and cp and all sorts of other tools but they always end up erroring out. So i wrote a perl script to pull the information in a more direct method. Using opendir and having it count all the files works perfect. It can count them all in about 10 seconds. Now I try to just step my script up one more notch and have it actually move the files and I get the error "File too large". This must be some sort of false error as the files themselves are all fairly small.
#!/usr/bin/perl
#############################################
# CopyFilesLite
# Russell Perkins
# 7/12/2010
#
# Tool is used to copy millions of files
# while using as little memory as possible.
#############################################
use strict;
use warnings;
use File::Copy;
#dir1, dir2 passed from command line
my $dir1 = shift;
my $dir2 = shift;
#Varibles to keep count of things
my $count = 0;
my $cnt_FileExsists = 0;
my $cnt_FileCopied = 0;
#simple error checking and validation
die "Usage: $0 directory1 directory2\n" unless defined $dir2;
die "Not a directory: $dir1\n" unless -d $dir1;
die "Not a directory: $dir2\n" unless -d $dir2;
opendir DIR, "$dir1" or die "Could not open $dir1: $!\n";
while (my $file = readdir DIR){
if (-e $dir2 .'/' . $file){
#print $file . " exsists in " . $dir2 . "\n"; #debuging
$cnt_FileExsists++;
}else{
copy($dir1 . '/' . $file,$dir2 . '/' . $file) or die "Copy failed: $!";
$cnt_FileCopied++;
#print $file . " does not exsists in " . $dir2 . "\n"; #debuging
}
$count++;
}
closedir DIR;
#ToDo: Clean up output.
print "Total files: $count\nFiles not copied: $cnt_FileExsists\nFiles Copied: $cnt_FileCopied\n\n";
So have any of you ran into this before? What would cause this and how can it be fixed?
On your error handling code, could you please change or die "Copy failed: $!"; to 'or die "Copy failed: '$dir1/$file' to '$dir2/$file': $!";' ?
Then it should tell you where the error happens.
Then check 2 things -
1) Does it fail every time on the same file?
2) Is that file somehow special? Weird name? Unusual size? Not a regular file? Not a file at all (as the other answer theorized)?
I am not sure if this is related to your problem, but readdir will return a list of all directory contents, including subdirectories, if present, and the current (.) and parent directories (..) on many operating systems. You may be attempting to copy directories as well as files.
The following will not attempt to copy any directories:
while (my $file = readdir DIR){
next if -d "$dir1/$file";
Seems this was an issue either with my nfs mount of the server that it was mounted to. I hooked up a usb drive to it and the files are copying with extreme speed...if you count usb 2 as extreme.
6.5 million images in one folder is very extreme and puts a load on the machine just to read a directory, whether it's in shell or Perl. That's one big folder structure.
I know you're chasing a solution in Perl now, but when dealing with that many files from the shell you'll want to take advantage of the xargs command. It can help a lot by grouping the files into manageable chunks. http://en.wikipedia.org/wiki/Xargs
Maybe the file system of partition you are send the data to do not support very large data.

Resources