Linux queue of files - linux

I have a few files waiting to be processed by a daily cron job:
file1
file2
file3
I want the job to take the first file and then rename the rest. file1 should be deleted. file2 should be renamed to file1, and file3 should be renamed to file2.
I'm looking for a solution that would work with any number of files.
Is there a simple way to do this with a script? Or, taking a step back, is there a standard Linux technique for handling a queue of files?

It looks like you are trying to implement a simple queueing mechanism for processing work on an arbitrary number of files, treating the filenames as queue positions (so that file1 is "head"). I think you're taking the queue metaphor a bit too literally into the filesystem space, however, as doing renames for all those files is extremely expensive in terms of filesystem operations and race-condition prone to boot (what if more files are added to the queue as you are renaming the previous ones?). What you should do instead is simply track the filenames to be operated on in a side file (e.g. don't traverse the filesystem looking for work, but traverse your "queue file") and lock that file whenever you're removing or adding an entry. A nice side-effect of that approach is that your filenames can then have any names you like, they don't have to be "file1, file2, ..."

You can use simple bash script as follows. It first list the files in the "folder1" to data.txt according to time stamp it created. Then first file will be removed. At last second file will be renamed first file continuously.
#!/bin/bash
# List the files in Folder1 folder
ls -tr folder1/ > data.txt
# Removing the first file
rm -rf "folder1/`head -n 1 data.txt`"
#Renaming the Old file name to new file
while IFS= read -r file
do
if [ -f "folder1/$file" ]; then
mv "folder1/$file" "folder1/$newFile"
fi
newFile="$file"
done

Related

How to rename multiple files in linux and store the old file names with the new file name in a text file?

I am a novice Linux user. I have 892 .pdb files, I want to rename all of them in a sequential order as L1,L2,L3,L4...........,L892. And then I want a text file which contains the old names assigned to new names ( i.e L1,L2,L3). Please help me with this. Thank you for your time.
You could just do:
#!/bin/sh
i=0
for f in *.pdb; do
: $((i += 1))
mv "$f" L"$i" && echo "$f --> L$i"
done > filelist
Note that you probably want to move the files into a different directory, as that will make it easier to recover if an error occurs midway through. Also be wary that this will overwrite any existing files and potentially cause a big mess. It's not idempotent (you can't run it twice). You would probably be better off not doing the move at all and instead do something like:
#!/bin/sh
i=0
mkdir -p newfiles
for f in *.pdb; do
ln "$f" newfiles/L"$((++i))" && printf "%s\0%s\0" "$f" "L$i"
done > filelist
This latter solution creates links to the original files in a subdirectory, so you can run it multiple times without munging the original data. Also, it uses null separators in the file list so you can unambiguously distinguish names that have newlines or tabs or spaces in them. It makes for a list that is not particularly human readable, but you can easily filter it through tr to make it pretty.

Maintaining variables in function - Global variables

Im trying to run a script in a function and then calling it
filedetails ()
{
# read TOTAL_DU < "/tmp/sizes.out";
disksize=`du -s "$1" | awk '{print $1}'`;
let TOTAL_DU+=$disksize;
echo "$TOTAL_DU";
# echo $TOTAL_DU > "/tmp/sizes.out"
}
Im using the variable TOTAL_DU as a counter to keep count of the du of all the files
Im running it using parallel or xargs
find . -type f | parallel -j 8 filedetails
But the variable TOTAL_DU is resetting every time and the count is not maintained which is as expected as a new shell is used each time.
I've also tried using a file to export and then read the counter but because of parallel some complete faster than others so its not sequential (as expected) so this is no good....
Question in is there a way to keep the count whilst using parallel or xargs
Aside from learning purposes, this is not likely to be a good use of parallel, because:
Calling du like that will quite possibly be slower than just invoking du in the normal way. First, the information about files sizes can be extracted from the directory, and so an entire directory can be computed in a single access. Effectively, directories are stored as a special kind of file object, whose data is a vector of directory entities ("dirents"), which contain the name and metadata for each file. What you are doing is using find to print these dirents, then getting du to parse each one (every file, not every directory); almost all of this second scan is redundant work.
Insisting that du examine every file prevents it from avoiding double-counting multiple hard-links to the same file. So you can easily end up inflating the disk usage this way. On the other hand, directories also take up diskspace, and normally du will include this space in its reports. But you're never calling it on any directory, so you will end up understating the total disk usage.
You're invoking a shell and an instance of du for every file. Normally, you would only create a single process for a single du. Process creation is a lot slower than reading a filesize from a directory. At a minimum, you should use parallel -X and rewrite your shell function to invoke du on all the arguments, rather than just $1.
There is no way to share environment variables between sibling shells. So you would have to accumulate the results in a persistent store, such as a temporary file or database table. That's also an expensive operation, but if you adopted the above suggestion, you would only need to do it once for each invocation of du, rather than for every file.
So, ignoring the first two issues, and just looking at the last two, solely for didactic purposes, you could do something like the following:
# Create a temporary file to store results
tmpfile=$(mktemp)
# Function which invokes du and safely appends its summary line
# to the temporary file
collectsizes() {
# Get the name of the temporary file, and remove it from the args
tmpfile=$1
shift
# Call du on all the parameters, and get the last (grand total) line
size=$(du -c -s "$#" | tail -n1)
# lock the temporary file and append the dataline under lock
flock "$tmpfile" bash -c 'cat "$1" >> "$2"' _ "$size" "$tmpfile"
}
export -f collectsizes
# Find all regular files, and feed them to parallel taking care
# to avoid problems if files have whitespace in their names
find -type f -print0 | parallel -0 -j8 collectsizes "$tmpfile"
# When all that's done, sum up the values in the temporary file
awk '{s+=$1}END{print s}' "$tmpfile"
# And delete it.
rm "$tmpfile"

How would I flatten and overlay multiple directories into one directory?

I want to take a list of directory hierarchies and flatten them into a single directory. Any duplicate file later in the list will replace an earlier file. For example...
foo/This/That.pm
bar/This/That.pm
bar/Some/Module.pm
wiff/This/That.pm
wiff/A/Thing/Here.pm
This would wind up with
This/That.pm # from wiff/
Some/Module.pm # from bar/
A/Thing/Here.pm # from wiff/
I have a probably over complicated Perl program to do this. I'm interested in the clever ways SO users might solve it. The big hurdle is "create the intermediate directories if necessary" perhaps with some combination of basename and dirname.
The real problem I'm solving is checking the difference between two installed Perl libraries. I'm first flattening the multiple library directories for each Perl into a single directory, simulating how Perl would search for a module. I can then diff -r them.
If you do not mind the final order of the entries, I guess this can do the job:
#!/bin/bash
declare -A directory;
while read line; do
directory["${line#*/}"]=${line%%/*}
done < $1
for entry in ${!directory[#]}; do
printf "%s\t# from %s/\n" $entry ${directory[$entry]}
done
Output:
$ ./script.sh files.txt
A/Thing/Here.pm # from wiff/
This/That.pm # from wiff/
Some/Module.pm # from bar/
And if you need to move the files, then you can simply replace the printing step with a mv -- or cp --, like this:
for entry in ${!directory[#]}; do
mv "${directory[$entry]}/$entry" "your_dir_path/$entry"
done

rsync and (post) process synced file

i like rsync my photos from one (linux) disc partition to an other (backup location) using an shell script.
The problem is, that I need to re-scale all photos which are saved on the backup location, for example with mogrify.
Is it possible to post-process every file, which is synced/copied by rsync?
In oder to execute mogrify on every synced file?
An other way could using rsync (only) to generate the list of files which have to be synced. Next step: run a loop to mogrify every list entry in order to output the scaled photo to the backup location.
The problem is, that I have to add all the directories and child-directories to keep the original folder structure before saving the photo.
Using rsync would handle the folder creation "on the fly".
So: is it possible to execute an command on every file synced with rsync?
rsync has a -i/--itemize-changes flag to output a resume of what it does with each file.
I suggest you to play a bit with it, I'm seeing it outputs lines like >f+++++++++ file1 for a new file, >f..T...... file1 for an unchanged file, >f.sT...... file1 for an update, etc...
Having that, you can read this output into a variable, and parse this with grep and cut:
#!/bin/bash
log=$(rsync -i rsync-client/* rsync-server/)
newFiles=$(echo "$log" | grep '>f+++++++++' | cut -d' ' -f2)
for file in $newFiles
do
echo "Added file $file"
done

Why doesn't "sort file1 > file1" work?

When I am trying to sort a file and save the sorted output in itself, like this
sort file1 > file1;
the contents of the file1 is getting erased altogether, whereas when i am trying to do the same with 'tee' command like this
sort file1 | tee file1;
it works fine [ed: "works fine" only for small files with lucky timing, will cause lost data on large ones or with unhelpful process scheduling], i.e it is overwriting the sorted output of file1 in itself and also showing it on standard output.
Can someone explain why the first case is not working?
As other people explained, the problem is that the I/O redirection is done before the sort command is executed, so the file is truncated before sort gets a chance to read it. If you think for a bit, the reason why is obvious - the shell handles the I/O redirection, and must do that before running the command.
The sort command has 'always' (since at least Version 7 UNIX) supported a -o option to make it safe to output to one of the input files:
sort -o file1 file1 file2 file3
The trick with tee depends on timing and luck (and probably a small data file). If you had a megabyte or larger file, I expect it would be clobbered, at least in part, by the tee command. That is, if the file is large enough, the tee command would open the file for output and truncate it before sort finished reading it.
It doesn't work because '>' redirection implies truncation, and to avoid keeping the whole output of sort in the memory before re-directing to the file, bash truncates and redirects output before running sort. Thus, contents of the file1 file will be truncated before sort will have a chance to read it.
It's unwise to depend on either of these command to work the way you expect.
The way to modify a file in place is to write the modified version to a new file, then rename the new file to the original name:
sort file1 > file1.tmp && mv file1.tmp file1
This avoids the problem of reading the file after it's been partially modified, which is likely to mess up the results. It also makes it possible to deal gracefully with errors; if the file is N bytes long, and you only have N/2 bytes of space available on the file system, you can detect the failure creating the temporary file and not do the rename.
Or you can rename the original file, then read it and write to a new file with the same name:
mv file1 file1.bak && sort file1.bak > file1
Some commands have options to modify files in place (for example, perl and sed both have -i options (note that the syntax of sed's -i option can vary). But these options work by creating temporary files; it's just done internally.
Redirection has higher precedence. So in the first case, > file1 executes first and empties the file.
The first command doesn't work (sort file1 > file1), because when using the redirection operator (> or >>) shell creates/truncates file before the sort command is even invoked, since it has higher precedence.
The second command works (sort file1 | tee file1), because sort reads lines from the file first, then writes sorted data to standard output.
So when using any other similar command, you should avoid using redirection operator when reading and writing into the same file, but you should use relevant in-place editors for that (e.g. ex, ed, sed), for example:
ex '+%!sort' -cwq file1
or use other utils such as sponge.
Luckily for sort there is the -o parameter which write results to the file (as suggested by #Jonathan), so the solution is straight forward: sort -o file1 file1.
Bash open a new empty file when reads the pipe, and then calls to sort.
In the second case, tee opens the file after sort has already read the contents.
You can use this method
sort file1 -o file1
This will sort and store back to the original file. Also, you can use this command to remove duplicated line:
sort -u file1 -o file1

Resources