Script to distribute a large number of files in to smaller groups - linux

I have folders containing large numbers of files (e.g. 1000+) of various sizes which I want to move in to smaller groups of, say, 100 files per folder.
I wrote an Apple Script which counted the files, created a numbered subfolder, and then moved 100 files in to the new folder (the number of files could be specified) which looped until there were less than specified number of files which it moved in to the last folder it created.
The problem was that it ran horrendously slowly. I'm looking for either an Apple Script or shell script I can run on my MacBook and/or Linux box which will efficiently move the files in to smaller groups.
How the files are grouped is not particularly significant, I just want fewer files in each folder.

This should get you started:
DIR=$1
BATCH_SIZE=$2
SUBFOLDER_NAME=$3
COUNTER=1
while [ `find $DIR -maxdepth 1 -type f| wc -l` -gt $BATCH_SIZE ] ; do
NEW_DIR=$DIR/${SUBFOLDER_NAME}${COUNTER}
mkdir $NEW_DIR
find $DIR -maxdepth 1 -type f | head -n $BATCH_SIZE | xargs -I {} mv {} $NEW_DIR
let COUNTER++
if [ `find $DIR -maxdepth 1 -type f| wc -l` -le $BATCH_SIZE ] ; then
mkdir $NEW_DIR
find $DIR -maxdepth 1 -type f | head -n $BATCH_SIZE | xargs -I {} mv {} $NEW_DIR
fi
done
The nested if statement gets the last remaining files. You can add some additional checks as you see needed after you modify for your use.

This is a tremendous kludge, but it shouldn't be too terribly slow:
rm /tmp/counter*
touch /tmp/counter1
find /source/dir -type f -print0 |
xargs -0 -n 100 \
sh -c 'n=$(echo /tmp/counter*); \
n=${n#/tmp/counter}; \
counter="/tmp/counter$n"; \
mv "$counter" "/tmp/counter$((n+1))"; \
mkdir "/dest/dir/$n"; \
mv "$#" "/dest/dir/$n"' _
It's completely indiscriminate as to which files go where.

The most common way to solve the problem of directories with too many files in them is to subdivide by the the first couple characters of the name. For example:
Before:
aardvark
apple
architect
...
zebra
zork
After:
a/aardvark
a/apple
a/architect
b/...
...
z/zebra
z/zork
If that isn't subdividing well enough, then go one step further:
a/aa/aardvark
a/ap/apple
a/ar/architect
...
z/ze/zebra
z/zo/zork
This should work quite quickly, because the move command that your script executes can use simple glob expansion to select all the files to move, ala mv aa* a/aa, as opposed to having to individually run a move command on each file (which would be my first guess as to why the original script was slow)

Related

LINUX Copy the name of the newest folder and paste it in a command [duplicate]

I would like to find the newest sub directory in a directory and save the result to variable in bash.
Something like this:
ls -t /backups | head -1 > $BACKUPDIR
Can anyone help?
BACKUPDIR=$(ls -td /backups/*/ | head -1)
$(...) evaluates the statement in a subshell and returns the output.
There is a simple solution to this using only ls:
BACKUPDIR=$(ls -td /backups/*/ | head -1)
-t orders by time (latest first)
-d only lists items from this folder
*/ only lists directories
head -1 returns the first item
I didn't know about */ until I found Listing only directories using ls in bash: An examination.
This ia a pure Bash solution:
topdir=/backups
BACKUPDIR=
# Handle subdirectories beginning with '.', and empty $topdir
shopt -s dotglob nullglob
for file in "$topdir"/* ; do
[[ -L $file || ! -d $file ]] && continue
[[ -z $BACKUPDIR || $file -nt $BACKUPDIR ]] && BACKUPDIR=$file
done
printf 'BACKUPDIR=%q\n' "$BACKUPDIR"
It skips symlinks, including symlinks to directories, which may or may not be the right thing to do. It skips other non-directories. It handles directories whose names contain any characters, including newlines and leading dots.
Well, I think this solution is the most efficient:
path="/my/dir/structure/*"
backupdir=$(find $path -type d -prune | tail -n 1)
Explanation why this is a little better:
We do not need sub-shells (aside from the one for getting the result into the bash variable).
We do not need a useless -exec ls -d at the end of the find command, it already prints the directory listing.
We can easily alter this, e.g. to exclude certain patterns. For example, if you want the second newest directory, because backup files are first written to a tmp dir in the same path:
backupdir=$(find $path -type -d -prune -not -name "*temp_dir" | tail -n 1)
The above solution doesn't take into account things like files being written and removed from the directory resulting in the upper directory being returned instead of the newest subdirectory.
The other issue is that this solution assumes that the directory only contains other directories and not files being written.
Let's say I create a file called "test.txt" and then run this command again:
echo "test" > test.txt
ls -t /backups | head -1
test.txt
The result is test.txt showing up instead of the last modified directory.
The proposed solution "works" but only in the best case scenario.
Assuming you have a maximum of 1 directory depth, a better solution is to use:
find /backups/* -type d -prune -exec ls -d {} \; |tail -1
Just swap the "/backups/" portion for your actual path.
If you want to avoid showing an absolute path in a bash script, you could always use something like this:
LOCALPATH=/backups
DIRECTORY=$(cd $LOCALPATH; find * -type d -prune -exec ls -d {} \; |tail -1)
With GNU find you can get list of directories with modification timestamps, sort that list and output the newest:
find . -mindepth 1 -maxdepth 1 -type d -printf "%T#\t%p\0" | sort -z -n | cut -z -f2- | tail -z -n1
or newline separated
find . -mindepth 1 -maxdepth 1 -type d -printf "%T#\t%p\n" | sort -n | cut -f2- | tail -n1
With POSIX find (that does not have -printf) you may, if you have it, run stat to get file modification timestamp:
find . -mindepth 1 -maxdepth 1 -type d -exec stat -c '%Y %n' {} \; | sort -n | cut -d' ' -f2- | tail -n1
Without stat a pure shell solution may be used by replacing [[ bash extension with [ as in this answer.
Your "something like this" was almost a hit:
BACKUPDIR=$(ls -t ./backups | head -1)
Combining what you wrote with what I have learned solved my problem too. Thank you for rising this question.
Note: I run the line above from GitBash within Windows environment in file called ./something.bash.

How to copy the contents of a folder to multiple folders based on number of files?

I want to copy the files from a folder (named: 1) to multiple folders based on the number of files (here: 50).
The code given below works. I transferred all the files from the folder to the subfolders based on number of files and then copied back all the files in the directory back to the initial folder.
However, I need something cleaner and more efficient. Apologies for the mess below, I'm a nube.
bf=1 #breakfolder
cd 1 #the folder from where I wanna copy stuff, contains 179 files
flies_exist=$(ls -1q * | wc -l) #assign the number of files in folder 1
#move 50 files from 1 to various subfolders
while [ $flies_exist -gt 50 ]
do
mkdir ../CompiledPdfOutput/temp/1-$bf
set --
for f in .* *; do
[ "$#" -lt 50 ] || break
[ -f "$f" ] || continue
[ -L "$f" ] && continue
set -- "$#" "$f"
done
mv -- "$#" ../CompiledPdfOutput/temp/1-$bf/
flies_exist=$(ls -1q * | wc -l)
bf=$(($bf + 1))
done
#mover the rest of the files into one final subdir
mkdir ../CompiledPdfOutput/temp/1-$bf
set --
for f in .* *; do
[ "$#" -lt 50 ] || break
[ -f "$f" ] || continue
[ -L "$f" ] && continue
set -- "$#" "$f"
done
mv -- "$#" ../CompiledPdfOutput/temp/1-$bf/
#get out of 1
cd ..
# copy back the contents from subdir to 1
find CompiledPdfOutput/temp/ -exec cp {} 1 \;
The required directory structure is:
parent
________|________
| |
1 CompiledPdfOutput
| |
(179) temp
|
---------------
| | | |
1-1 1-2 1-3 1-4
(50) (50) (50) (29)
The number inside "()" denotes the number of files.
BTW, the final step of my code gives this warning, would be glad if anyone can explain what's happening and a solution.
cp: -r not specified; omitting directory 'CompiledPdfOutput/temp/'
cp: -r not specified; omitting directory 'CompiledPdfOutput/temp/1-4'
cp: -r not specified; omitting directory 'CompiledPdfOutput/temp/1-3'
cp: -r not specified; omitting directory 'CompiledPdfOutput/temp/1-1'
cp: -r not specified; omitting directory 'CompiledPdfOutput/temp/1-2'
I dont wnt to copy the directory as well, just the files so giving -r would be bad.
Assuming that you need something more compact/efficient, you can leverage existing tools (find, xargs) to create a pipeline, eliminating the need to program each step using bash.
The following will move the files into the split folder. It will find the files, group them, 50 into each folder, use awk to generate output folder, and move the files. Solution not as elegant as original one :-(
find 1 -type f |
xargs -L50 echo |
awk '{ print "CompliedOutput/temp/1-" NR, $0 }' |
xargs -L1 echo mv -t
As a side note, current script moves the files from the '1' folder, to the numbered folders, and then copy the file back to the original folder. Why not just copy the files to the numbered folders. You can use 'cp -p' to to preserve timestamp, if that's needed.
Supporting file names with new lines (and spaces)
Clarification to question indicate a solution should work with file names with embedded new lines (and while spaces). This require minor change to use NUL character as separator.
# Count number of output folders
DIR_COUNT=$(find 1 -type f -print0 | xargs -0 -I{} echo X | wc -l)
# Remove previous tree, and create folder
OUT=CompiledOutput/temp
rm -rf $OUT
eval mkdir -p $OUT/1-{1..$DIR_COUNT}
# Process file, use NUL as separator
find 1 -type f -print0 |
awk -vRS="\0" -v"OUT=$OUT" 'NR%50 == 1 { printf "%s/1-%d%s",OUT,1+int(NR/50),RS } { printf "%s", ($0 RS) }' |
xargs -0 -L51 -t mv -t
Did limited testing with both space and new lines in the file. Looks OK on my machine.
I find a couple of issues with the posted script:
The logic of copying maximum 50 files per folder is overcomplicated, and the code duplication of an entire loop is error-prone.
It uses reuses the $# array of positional parameters for internal storage purposes. This variable was not intended for that, it would be better to use a new dedicated array.
Instead of moving files to sub-directories and then copying them back, it would be simpler to just copy them in the first step, without ever moving.
Parsing the output of ls is not recommended.
Consider this alternative, simpler logic:
Initialize an empty array to_copy, to keep files that should be copied
Initialize a folder counter, to use to compute the target folder
Loop over the source files
Apply filters as before (skip if not file)
Add file to to_copy
If to_copy contains the target number of files, then:
Create the target folder
Copy the files contained in to_copy
Reset the content of to_copy to empty
Increment folder_counter
If to_copy is not empty
Create the target folder
Copy the files contained in to_copy
Something like this:
#!/usr/bin/env bash
set -euo pipefail
distribute_to_folders() {
local src=$1
local target=$2
local max_files=$3
local to_copy=()
local folder_counter=1
for file in "$src"/* "$src/.*"; do
[ -f "$file" ] || continue
to_copy+=("$file")
if (( ${#to_copy[#]} == max_files )); then
mkdir -p "$target/$folder_counter"
cp -v "${to_copy[#]}" "$target/$folder_counter/"
to_copy=()
((++folder_counter))
fi
done
if (( ${#to_copy[#]} > 0 )); then
mkdir -p "$target/$folder_counter"
cp -v "${to_copy[#]}" "$target/$folder_counter/"
fi
}
distribute_to_folders "$#"
To distribute files in path/to/1 into directories of maximum 50 files under path/to/compiled-output, you can call this script with:
./distribute.sh path/to/1 path/to/compiled-output 50
BTW, the final step of my code gives this warning, would be glad if anyone can explain what's happening and a solution.
Sure. The command find CompiledPdfOutput/temp/ -exec cp {} 1 \; finds files and directories, and tries to copy them. When cp encounters a directory and the -r parameter is not specified, it issues the warning you saw. You could add a filter for files, with -type f. If there are not excessively many files then a simple shell glob will do the job:
cp -v CompiledPdfOutput/temp/*/* 1
This will copy files to multiple folders of fixed size. Change source, target, and folderSize as per your requirement. This also works with filenames with special character (e.g. 'file 131!##$%^&*()_+-=;?').
source=1
target=CompiledPDFOutput/temp
folderSize=50
find $source -type f -printf "\"%p\"\0" \
| xargs -0 -L$folderSize \
| awk '{system("mkdir -p '$target'/1-" NR); printf "'$target'/1-" NR " %s\n", $0}' \
| xargs -L1 cp -t

Delete files 100 at a time and count total files

I have written a bash script to delete 100 files at a time from a directory because i was getting args list too long error but now i want to count the total files that were deleted in total from the directory
Here is the script
echo /example-dir/* | xargs -n 100 rm -rf
What i want is to write the total deleted files from each directory into a file along with path for example Deleted <count> files from <path>
How can i achieve this with my current setup?
You can simply do this by enabling verbose output from rm and then simply count the output lines using wc -l
If you have whitespaces or special characters in the file names, using echo to pass the list of files to xargs will not work.
Better use find with -print0 to use a NULL character as a delimiter for the individual files:
find /example-dir -type f -print0 | xargs --null -n 100 rm -vrf | wc -l
You can avoid xargs and do this in a simple while loop and use a counter:
destdir='/example-dir/'
count=0
while IFS= read -d '' file; do
rm -rf "$file"
((count++))
done < <(find "$destdir" -type f -print0)
echo "Deleted $count files from $destdir"
Note use of -print0 to take care of file names with whitespaces/newlines/glob etc.
By the way, if you really have lots of files and you do this often, it might be useful to look at some other options:
Use find's built-in -delete
time find . -name \*.txt -print -delete | wc -l
30000
real 0m1.244s
user 0m0.055s
sys 0m1.037s
Use find's ability to build up maximal length argument list
time find . -name \*.txt -exec rm -v {} + | wc -l
30000
real 0m0.979s
user 0m0.043s
sys 0m0.920s
Use GNU Parallel's ability to build long argument lists
time find . -name \*.txt -print0 | parallel -0 -X rm -v | wc -l
30000
real 0m1.076s
user 0m1.090s
sys 0m1.223s
Use a single Perl process to read filenames and delete whilst counting
time find . -name \*.txt -print0 | perl -0ne 'unlink;$i++;END{print $i}'
30000
real 0m1.049s
user 0m0.057s
sys 0m1.006s
For testing, you can create 30,000 files really fast with GNU Parallel, which allows -X to also build up long argument lists. For example, I can create 30,000 files in 8 seconds on my Mac with:
seq -w 0 29999 | parallel -X touch file{}.txt

Find a bunch of randomly sorted images on disk and copy to target dir

For testing purposes I need a bunch of random images from disc, copied to a specific directory. So, in pseudo code:
find [] -iname "*.jpg"
and then sort -R
and then head -n [number wanted]
and then copy to destination
Is it possible to combine above commands in a single bash command? Like eg:
for i in `find ./images/ -iname "*.jpg" | sort -R | head -n243`; do cp "$i" ./target/; done;
But that doesn't quite work. I feel I'll need an 'xargs' somewhere in there, but I'm afraid I don't understand xargs very well... would I need to pass a 'print0' (or equivalent) to all seperate commands?
[edit]
I left out the final step: I'd like to copy the images to a certain directory under a new (sequential) name. So the first image becomes 1.jpg, the second 2.jpg etc. For this, the command I posted does not work as intended.
The command that you specified also will work without any issues. It works for me well. Can you point out the exact error you are facing.
Meanwhile,
This will just do the trick for you:
find ./images/ -iname "*.jpg" | sort -R | head -n <no. of files> | xargs -I {} cp {} target/
Simply use shuf -n.
Example:
find ./images/ -iname "*.jpg" | shuf -n 10 | xargs cp -t ./target/
It would copy 10 random images to ./target/. If you need 243 just use shuf -n 243.
According to your edit, this should do :
for i in `find ./images/ -iname "*.jpg" | sort -R | head -n2`; do cp $i ./target/$((1 + compt++)).jpg; done;`
Here, you add a counter to keep track of the number of files you already copied.

Delete all files in a directory, except those listed matching specific criteria

I need to automate a clean-up of a Linux based FTP server that only holds backup files.
In our "\var\DATA" directory is a collection of directories. Any directory here used for backup begins with "DEV". In each "DEVxxx*" directory are the actual backup files, plus any user files that may have been needed in the course of maintenance on these devices.
We only want to retain the following files - anything else found in these "DEVxxx*" directories is to be deleted:
The newest two backups: ls -t1 | grep -m2 ^[[:digit:]{6}_Config]
The newest backup done on the first of the month: ls -t1 | grep -m1 ^[[:digit:]{4}01_Config]
Any file that was modified less than 30 days ago: find -mtime -30
Our good configuration file: ls verification_cfg
Anything that doesn't match the above should be deleted.
How can we script this?
I'm guessing a BASH script can do this, and that we can create a cron job to run daily to perform the task.
Something like this perhaps?
{ ls -t1 | grep -m2 ^[[:digit:]{6}_Config] ;
ls -t1 | grep -m1 ^[[:digit:]{4}01_Config] ;
find -mtime -30 ;
ls -1 verification_cfg ;
} | rsync -a --exclude=* --include-from=- /var/DATA/ /var/DATA.bak/
rm -rf /var/DATA
mv /var/DATA.bak /var/DATA
For what it's worth, here is the bash script I created to accomplish my task. Comments are welcome.
#!/bin/bash
# This script follows these rules:
#
# - Only process directories beginning with "DEV"
# - Do not process directories within the device directory
# - Keep files that match the following criteria:
# - Keep the two newest automated backups
# - Keep the six newest automated backups generated on the first of the month
# - Keep any file that is less than 30 days old
# - Keep the file "verification_cfg"
#
# - An automated backup file is identified as six digits, followed by "_Config"
# e.g. 20120329_Config
# Remember the current directory
CurDir=`pwd`
# FTP home directory
DatDir='/var/DATA/'
cd $DatDir
# Only process directories beginning with "DEV"
for i in `find . -type d -maxdepth 1 | egrep '\.\/DEV' | sort` ; do
cd $DatDir
echo Doing "$i"
cd $i
# Set the GROUP EXECUTE bit on all files
find . -type f -exec chmod g+x {} \;
# Find the two newest automated config backups
for j in `ls -t1 | egrep -m2 ^[0-9]{8}_Config$` ; do
chmod g-x $j
done
# Find the six newest automated config backups generated on the first of the month
for j in `ls -t1 | egrep -m6 ^[0-9]{6}01_Config$` ; do
chmod g-x $j
done
# Find all files that are less than 30 days old
for j in `find -mtime -30 -type f` ; do
chmod g-x $j
done
# Find the "verification_cfg" file
for j in `find -name verification_cfg` ; do
chmod g-x $j
done
# Remove any files that still have the GROUP EXECUTE bit set
find . -type f -perm -g=x -exec rm -f {} \;
done
# Back to the users current directory
cd $CurDir

Resources