Ok so i have 6.5 Million images in a folder and I need to get them moved asap. I will be moving them into their own folder structure but first I must get them moved off this server.
I tried rsync and cp and all sorts of other tools but they always end up erroring out. So i wrote a perl script to pull the information in a more direct method. Using opendir and having it count all the files works perfect. It can count them all in about 10 seconds. Now I try to just step my script up one more notch and have it actually move the files and I get the error "File too large". This must be some sort of false error as the files themselves are all fairly small.
#!/usr/bin/perl
#############################################
# CopyFilesLite
# Russell Perkins
# 7/12/2010
#
# Tool is used to copy millions of files
# while using as little memory as possible.
#############################################
use strict;
use warnings;
use File::Copy;
#dir1, dir2 passed from command line
my $dir1 = shift;
my $dir2 = shift;
#Varibles to keep count of things
my $count = 0;
my $cnt_FileExsists = 0;
my $cnt_FileCopied = 0;
#simple error checking and validation
die "Usage: $0 directory1 directory2\n" unless defined $dir2;
die "Not a directory: $dir1\n" unless -d $dir1;
die "Not a directory: $dir2\n" unless -d $dir2;
opendir DIR, "$dir1" or die "Could not open $dir1: $!\n";
while (my $file = readdir DIR){
if (-e $dir2 .'/' . $file){
#print $file . " exsists in " . $dir2 . "\n"; #debuging
$cnt_FileExsists++;
}else{
copy($dir1 . '/' . $file,$dir2 . '/' . $file) or die "Copy failed: $!";
$cnt_FileCopied++;
#print $file . " does not exsists in " . $dir2 . "\n"; #debuging
}
$count++;
}
closedir DIR;
#ToDo: Clean up output.
print "Total files: $count\nFiles not copied: $cnt_FileExsists\nFiles Copied: $cnt_FileCopied\n\n";
So have any of you ran into this before? What would cause this and how can it be fixed?
On your error handling code, could you please change or die "Copy failed: $!"; to 'or die "Copy failed: '$dir1/$file' to '$dir2/$file': $!";' ?
Then it should tell you where the error happens.
Then check 2 things -
1) Does it fail every time on the same file?
2) Is that file somehow special? Weird name? Unusual size? Not a regular file? Not a file at all (as the other answer theorized)?
I am not sure if this is related to your problem, but readdir will return a list of all directory contents, including subdirectories, if present, and the current (.) and parent directories (..) on many operating systems. You may be attempting to copy directories as well as files.
The following will not attempt to copy any directories:
while (my $file = readdir DIR){
next if -d "$dir1/$file";
Seems this was an issue either with my nfs mount of the server that it was mounted to. I hooked up a usb drive to it and the files are copying with extreme speed...if you count usb 2 as extreme.
6.5 million images in one folder is very extreme and puts a load on the machine just to read a directory, whether it's in shell or Perl. That's one big folder structure.
I know you're chasing a solution in Perl now, but when dealing with that many files from the shell you'll want to take advantage of the xargs command. It can help a lot by grouping the files into manageable chunks. http://en.wikipedia.org/wiki/Xargs
Maybe the file system of partition you are send the data to do not support very large data.
Related
How to delete (remove | trim) N bytes from the beginning of a binary file without loading it in the memory?
We have fs.ftruncate(fd, len, callback), which cuts out bytes from the end of the file (if it is bigger).
How to cut bytes from the beginning, or trim from the beginning in Node.js without reading a file in the memory?
I need something like truncateFromBeggining(fd, len, callback) or removeBytes(fd, 0, N, callback).
If it is not possible, what is the fastest way to do it with file streams?
On most filesystems you can't "cut" a part out from the beginning or from the middle of a file, you can only truncate it at the end.
Having the above in mind I imagine, we have to probably open the input file stream, to seek to after the Nth byte, and to pipe the rest of the bytes to an output file stream.
You're asking for an OS file system operation: the ability to remove some bytes from the beginning of a file in place, without rewriting the file.
You're asking for a file system operation that does not exist, at least in Linux / FreeBSD / MacOS / Windows.
If your program is the only user of the file and it fits in RAM, your best bet is to read the whole thing into RAM, then reopen the file for writing, then write out the part you want to keep.
Or you can create a new file. Let's say your input file is called q. Then you'd create a file called, maybe new_q with a stream attached. You'd pipe the contents you wanted to the new file. Then you'd unlink (delete) the input file q and rename the output file new_q to q.
Careful: this unlink / rename operation will create a short time when no file named q is available. So if some other program tries to open it and doesn't find it, it should try again a few times.
If you're creating a queueing scheme, you might consider using some other scheme to hold your queue data. This file read / rewrite / unlink / rename sequence has lots of ways it can go wrong on you under heavy load. (Ask me how I know that when you have a couple of hours to spare ;-) redis is worth a look.
I decided to solve the problem in bash.
The script truncates the files in a temp folder first, then moves them back to the original folder.
The truncate is done with tail:
tail --bytes="$max_size" "$from_file" > "$to_file"
The full script:
#!/bin/bash
declare -r store="/my/data/store"
declare -r temp="/my/data/temp"
declare -r max_size=$(( 200000 * 24 ))
or_exit() {
local exit_status=$?
local message=$*
if [ $exit_status -gt 0 ]
then
echo "$(date '+%F %T') [$(basename "$0" .sh)] [ERROR] $message" >&2
exit $exit_status
fi
}
# Checks if there are any files in 'temp'. It should be empty.
! ls "$temp/"* &> '/dev/null'
or_exit 'Temp folder is not empty'
# Loops over all the files in 'store'
for file_path in "$store/"*
do
# Trim bigger then 'max_size' files from 'store' to 'temp'
if [ "$( stat --format=%s "$file_path" )" -gt "$max_size" ]
then
# Truncates the file to the temp folder
tail --bytes="$max_size" "$file_path" > "$temp/$(basename "$file_path")"
or_exit "Cannot tail: $file_path"
fi
done
unset -v file_path
# If there are files in 'temp', move all of them back to 'store'
if ls "$temp/"* &> '/dev/null'
then
# Moves all the truncated files back to the store
mv "$temp/"* "$store/"
or_exit 'Cannot move files from temp to store'
fi
So I am new to Perl and trying to simply open a directory, and list all its files. When I run this very simple code below trying to print everything in /usr/bin it will not work, and no matter what I try I keep getting told 'Could not open /usr/bin: No such file or directory'.
Any help would be much appreciated!
#!/usr/bin/perl
$indir = "/usr/bin";
# read in all files from the directory
opendir (DIR, #indir) or die "Could not open $indir: $!\n";
while ($filename = readdir(DIR)) {
print "$filename\n";
}
closedir(DIR);
Here is another place where the very basic troubleshooting step of use strict; and use warnings; has been omitted, and it would have told you exactly what was wrong.
Global symbol "#indir" requires explicit package name (did you forget to declare "my #indir"?)
Of course, you'd also have to fix a few other errors (e.g. my $indir = '/usr/bin';)
I would also suggest that readdir is not well suited for this job, and would tend to recommend glob:
#!/usr/bin/env perl
use strict;
use warnings;
my $indir = "/usr/bin";
# read in all files from the directory
foreach my $filename ( glob "$indir/*" ) {
print "$filename\n";
}
Note how this differs - it prints a full path to the file, and it omits certain things (like . and ..) which is in my opinion, more generally useful. Not least because another really common error is to open my $fh, '<', $filename or die $!, forgetting that it's not in the current working directory.
I am trying to figure out a way to determine the most recent file created in a directory. I cannot use a module and I am on a Linux OS.
just a simple google gave me good answer
#list = `ls -t`;
$newest = $list[0];
or completely in perl
opendir(my $DH, $DIR) or die "Error opening $DIR: $!";
my %files = map { $_ => (stat("$DIR/$_"))[9] } grep(! /^\.\.?$/, readdir($DH));
closedir($DH);
my #sorted_files = sort { $files{$b} <=> $files{$a} } (keys %files);
$sorted_files[0] is the most-recently modified. If it isn't the actual
file-of-interest, you can iterate through #sorted_files until you find
the interesting file(s).
No you can not get the files on the basis of their birth date, as their is no linux command to get the birth date of a file, but of-course you can get the access, modification and change information about the file. To get the access, modification and change time information of any file use this :
stat file-name
Also, to get the most recent changed/modified file use this:
ls -ltr | tail -1
Try:
cd DIR
ls -l -rt | tail -1
The naughty IO::All method: ;-)
use IO::All;
use v5.20;
# sort files by their modification time and store in an array:
my #files = sort{$b->mtime <=> $a->mtime} io(".")->all_files;
# get the first/newest file from the sort:
say "$files[0] ". ~~localtime($files[0]->mtime);
I used File::Find to traverse a directory tree and Win32::File's GetAttributes function to look at the attributes of files found in it. This worked in a single-threaded program.
Then I moved the directory traversal into a separate thread, and it stopped working. GetAttributes failed on every file with "The system cannot find the file specified" as the error message in $^E.
I traced the problem to the fact that File::Find uses chdir, and apparently GetAttributes doesn't use the current directory. I could work around this by passing it an absolute path, but then I could run into path length limits, and long paths are definitely going to be present where this script will run, so I really need to take advantage of chdir and relative paths.
To demonstrate the problem, here is a script which creates a file in the current directory, another file in a subdirectory, chdir's to the subdirectory, and looks for the file 3 ways: system("dir"), open, and GetAttributes.
When the script is run without arguments, dir shows the subdirectory, open finds the file in the subdirectory, and GetAttributes returns its attributes successfully. When run with --thread, all the tests are done in a subthread, and the dir and open still work, but the GetAttributes fails. Then it calls GetAttributes on the file that is in the original directory (which we have chdir'ed out of) and it finds that one! Somehow GetAttributes is using the original working directory of the process - or maybe the working directory of the main thread - unlike all the other file operations.
How can I fix this? I can guarantee that the main thread won't do any chdir'ing, if that matters.
use strict;
use warnings;
use threads;
use Data::Dumper;
use Win32::File qw/GetAttributes/;
sub doit
{
chdir("testdir") or die "chdir: $!\n";
system "dir";
my $attribs;
open F, '<', "file.txt" or die "open: $!\n";
print "open succeeded. File contents:\n-------\n", <F>, "\n--------\n";
close F;
my $x = GetAttributes("file.txt", $attribs);
print Dumper [$x, $attribs, $!, $^E];
if(!$x) {
# If we didn't find the file we were supposed to find, how about the
# bad one?
$x = GetAttributes("badfile.txt", $attribs);
if($x) {
print "GetAttributes found the bad file!\n";
if(open F, '<', "badfile.txt") {
print "opened the bad file\n";
close F;
} else {
print "But open didn't open it. Error: $! ($^E)\n";
}
}
}
}
# Setup
-d "testdir" or mkdir "testdir" or die "mkdir testdir: $!\n";
if(!-f "badfile.txt") {
open F, '>', "badfile.txt" or die "create badfile.txt: $!\n";
print F "bad\n";
close F;
}
if(!-f "testdir/file.txt") {
open F, '>', "testdir/file.txt" or die "create testdir/file.txt: $!\n";
print F "hello\n";
close F;
}
# Option 1: do it in the main thread - works fine
if(!(#ARGV && $ARGV[0] eq '--thread')) {
doit();
}
# Option 2: do it in a secondary thread - GetAttributes fails
if(#ARGV && $ARGV[0] eq '--thread') {
my $thr = threads->create(\&doit);
$thr->join();
}
Eventually, I figured out that perl is maintaining some kind of secondary cwd that only applies to perl built-in operators, while GetAttributes is using the native cwd. I don't know why it does this or why it only happens in the secondary thread; my best guess is that perl is trying to emulate the unix rule of one cwd per process, and failing because the Win32::* modules don't play along.
Whatever the reason, it's possible to work around it by forcing the native cwd to be the same as perl's cwd whenever you're about to do a Win32::* operation, like this:
use Cwd;
use Win32::FindFile qw/SetCurrentDirectory/;
...
SetCurrentDirectory(getcwd());
Arguably File::Find should do this when running on Win32.
Of course this only makes the "pathname too long" problem worse, because now every directory you visit will be the target of an absolute-path SetCurrentDirectory; try to work around it with a series of smaller SetCurrentDirectory calls and you have to figure out a way to get back where you came from, which is hard when you don't even have fchdir.
I have the following situation:
There is a windows folder that has been mounted on a Linux machine. There could be multiple folders (setup before hand)
in this windows mount. I have to do something (preferably a script to start with) to watch these folders.
These are the steps:
Watch for any incoming file(s). Make sure they are transferred completely.
Move it to another folder.
I do not have any control over the file transfer program on the windows machine. It is a secure FTP I believe.
So I cannot ask that process to send me a trailer file to ensure the completion of file transfer.
I have written a bash script. I would like to know about any potential pitfalls with this approach. Reason is,
there is a possibility of mulitple copies of this script running for multiple directories like this.
At the moment, there could be upto 100 directories that may have to be monitored.
Following is the script. I'm sorry for pasting a very long one here. Please take your time to review it and
comment / criticize it. :-)
It takes 3 parameters, the folder that has to be watched, the folder where the file has to be moved,
and a time interval, which has been explained below.
I'm sorry there seems to be a problem with the alignment. Markdown doesn't seem to like it. I tried to organize it properly, but not able to do so.
Linux servername 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:27:17 EDT 2006 i686 i686 i386
GNU/Linux
#!/bin/bash
log_this()
{
message="$1"
now=`date "+%D-%T"`
echo $$": "$now ": " $message
}
usage()
{
cat << EOF
Usage: $0 <Directory to be watched> <Directory to transfer> <time interval>
Time interval is the amount of time after which the modification time of a
file will be monitored.
EOF
`exit 1`
}
if [ $# -lt 2 ]
then
usage
fi
WATCH_DIR=$1
APP_DIR=$2
if [ ! -d "$WATCH_DIR" ]
then
log_this "FATAL: WATCH_DIR, $WATCH_DIR does not exist. Exiting"
exit 1
fi
if [ ! -d "$APP_DIR" ]
then
log_this "APP_DIR: $APP_DIR does not exist. Exiting"
exit 1
fi
# This needs to be set after considering the rate of file transfer.
# Represents the seconds elapsed after the last modification to the file.
# If not supplied as parameter, defaults to 3.
seconds_between_mods=$3
if ! [[ "$seconds_between_mods" =~ ^[0-9]+$ ]]; then
if [ ${#seconds_between_mods} -eq 0 ]; then
log_this "No value supplied for elapse time. Defaulting to 3."
seconds_between_mods=3
else
log_this "Invalid value provided for elapse time"
exit 1
fi
fi
log_this "Start Monitor."
while true
do
ls -1 $WATCH_DIR | while read file_name
do
log_this "Start Monitoring for $file_name"
# Refer only the modification with reference to the mount folder.
# If there is a diff in time between servers, we are in trouble.
token_file=$WATCH_DIR/foo.$$
current_time=`touch $token_file && stat -c "%Y" $token_file`
rm -f $token_file 2>/dev/null
log_this "Current Time: $current_time"
last_mod_time=`stat -c "%Y" $WATCH_DIR/$file_name`
elapsed_time=`expr $current_time - $last_mod_time`
log_this "Elapsed time ==> $elapsed_time"
if [ $elapsed_time -ge $seconds_between_mods ]
then
log_this "Moving $file_name to $APP_DIR"
# In case if there is no space left on the target mount, hide the file
# in the mount itself and remove the incomplete file from APP_DIR.
mv $WATCH_DIR/$file_name $APP_DIR
if [ $? -ne 0 ]
then
log_this "FATAL: mv failed!! Hiding $file_name"
rm $APP_DIR/$file_name
mv $WATCH_DIR/$file_name $WATCH_DIR/.$file_name
log_this "Removed $APP_DIR/$file_name. Look for $WATCH_DIR/.$file_name and submit later."
fi
log_this "End Monitoring for $file_name"
else
log_this "$file_name: Transfer seems to be in progress"
fi
done
log_this "Nothing more to monitor."
echo
sleep 5
done
This isn't going to work for any length of time. In production, you will have network problems and other errors which can leave a partial file in the upload directory. I also don't like the idea of a "trailer" file. The usual approach is to upload the file under a temporary name and then rename it after the upload completes.
This way, you just have to list the directory, filter the temporary names out and and if there is anything left, use it.
If you can't make this change, then ask your boss for a written permission to implement something which can lead to arbitrary data corruption. This is for two purposes: 1) To make them understand that this is a real problem and not something which you make up and 2) to protect yourself when it breaks ... because it will and guess who'll get all the blame?
I believe a much saner approach would be the use of a kernel-level filesystem notify item. Such as inotify. Get also the tools here.
incron is an "inotify cron" system. It consists of a daemon and a table manipulator. You can use it a similar way as the regular cron. The difference is that the inotify cron handles filesystem events rather than time periods.
First make sure inotify-tools in installed.
Then use them like this:
logOfChanges="/tmp/changes.log.csv" # Set your file name here.
# Lock and load
inotifywait -mrcq $DIR > "$logOfChanges" & # monitor, recursively, output CSV, be quiet.
IN_PID=$$
# Do your stuff here
...
# Kill and analyze
kill $IN_PID
cat "$logOfChanges" | while read entry; do
# Split your CSV, but beware that file names may contain spaces too.
# Just look up how to parse CSV with bash. :)
path=...
event=...
... # Other stuff like time stamps
# Depending on the event…
case "$event" in
SOME_EVENT) myHandlingCode path ;;
...
*) myDefaultHandlingCode path ;;
done
Alternatively, using --format instead of -c on inotifywait would be an idea.
Just man inotifywait and man inotifywatch for more infos.
To be honest a python app set up to run at start-up will do this quickly and efficiently. Python has amazing OS support and its rather complete.
Running the script will likely work, but it will be troublesome to take care and manage. I take it you will run these as frequent cron jobs?
To get you off your feet here is a small app I wrote which takes a path and looks at the binary output of jpeg files. I never quite finished it, but it will get you started and to see the structure of python as well as some use of os..
I wouldnt spend to much time worrying about my code.
import time, os, sys
#analyze() takes in a path and moves into the output_files folder, to then analyze files
def analyze(path):
list_outputfiles = os.listdir(path + "/output_files")
print list_outputfiles
for i in range(len(list_outputfiles)):
#print list_outputfiles[i]
f = open(list_outputfiles[i], 'r')
f.readlines()
#txtmaker reads the media file and writes its binary contents to a text file.
def txtmaker(c_file):
print c_file
os.system("cat" + " " + c_file + ">" + " " + c_file +".txt")
os.system("mv *.txt output_files")
#parser() takes in the inputed path, reads and lists all files, creates a directory, then calls txtmaker.
def parser(path):
os.chdir(path)
os.mkdir(path + "/output_files", 0777)
list_files = os.listdir(path)
for i in range(len(list_files)):
if os.path.isdir(list_files[i]) == True:
print (list_files[i], "is a directory")
else:
txtmaker(list_files[i])
analyze(path)
def main():
path = raw_input("Enter the full path to the media: ")
parser(path)
if __name__ == '__main__':
main()