Linux/Bash - Create Loop for "find" to output to a file

Linux/Bash - Create Loop for "find" to output to a file - linux

Apologies if this has been answered, I'm somewhat new to Linux but I didn't see anything here that was on target.
Anyway, I'm running this command:
find 2013-12-28 -name '*.gz' | xargs zcat | gzip > /fast/me/2013-12-28.csv.gz
The issue is that I need to run this command for about 250 distinct dates, so doing this one at a time is quite tedious.
What I want to do is have a script that will increment the date by 1 day after the "find" and in the file name. I really don't even know what this would look like, what commands to use, etc.
Background:
The find command is being used in a folder that's full of folders, each for 1 day of data. Each day's folder contains 24 subfolders, with each subfolder containing about 100 gzipped CSV files. So the find command is necessary 2 levels up from the folder because it will scan through each folder to combine all the data. The end result is that all the zipped up files are combined into 1 large zipped up file.
If anyone can help it would be hugely appreciated, otherwise I have about 250 more commands to execute, which obviously will suck.

What about something like this?
prev_date="2013-12-28"
for i in {0..250}; do
next_date=$(date -d"$prev_date +1 day" +%Y-%m-%d)
prev_date=$next_date
find $next_date -name '*.gz' | xargs zcat | gzip > /fast/me/$next_date.csv.gz
done
It should iterate through 250 dates like:
2014-08-27
2014-08-28
2014-08-29
2014-08-30
2014-08-31
2014-09-01
2014-09-02
2014-09-03
2014-09-04
2014-09-05

jmunsch's solution works very well if the dates are sequential. Otherwise you could do this:
(edited to replace dash characters with colons)
for folderName in $(find . -type d -mindepth 1 -maxdepth 1 )
do
date=$(basename $folderName)
dateWithColons=$(echo $date | sed "s#-#:#g") # this will replace - with :
find "$folderName" -name '*.gz' | xargs zcat | gzip > /fast/me/$dateWithColons.csv.gz
done

Related

list base files in a folder with numerous date stampped versions of a file

I've got a folder with numerous versions of files (thousands of them), each with a unique date/time stamp as the file extension. For example:
./one.20190422
./one.20190421
./one.20190420
./folder/two.txt.20190420
./folder/two.txt.20190421
./folder/folder/three.mkv.20190301
./folder/folder/three.mkv.20190201
./folder/folder/three.mkv.20190101
./folder/four.doc.20190401
./folder/four.doc.20190329
./folder/four.doc.20190301
I need to get a unique list of the base files. For example, for the above example, this would be the expected output:
./one
./folder/two.txt
./folder/folder/three.mkv
./folder/four.doc
I've come up with the below code, but am wondering if there is a better, more efficient way.
# find all directories
find ./ -type d | while read folder ; do
# go into that directory
# then find all the files in that directory, excluding sub-directories
# remove the extension (date/time stamp)
# sort and remove duplicates
# then loop through each base file
cd "$folder" && find . -maxdepth 1 -type f -exec bash -c 'printf "%s\n" "${#%.*}"' _ {} + | sort -u | while read file ; do
# and find all the versions of that file
ls "$file".* | customFunctionToProcessFiles
done
done
If it matters, the end goal is find all the versions of a specific file, in groups of the base file, and process them for something. So my plan was to get the base files, then loop through the list and find all the version files. So, using the above example again, I'd process all the one.* files first, then the two.* files, etc...
Is there a better, faster, and/or more efficient way to accomplish this?
Some notes:
There are potentially thousands of files. I know I could just search for all files from the root folder, remove the date/time extension, sort and get unique, but since there may be thousands of files I thought it might be more efficient to loop through the directories.
The date/time stamp extension of the file is not in my control and it may not always be just numbers. The only thing I can guarantee is it is on the end after a period. And, whatever format the date/time is in, all the files will share it -- there won't be some files with one format and other files with another format.

You can use find ./ -type f -regex to look for files directly
find ./ -type f -regex '.*\.[0-9]+'
./some_dir/asd.mvk.20190422
./two.txt.20190420
Also, pipe the result to your function through xargs whithout needing while loops
re='(.*)(\.[0-9]{8,8})'
find ./ -type f -regextype posix-egrep -regex "$re" | \
sed -re "s/$re/\1/" | \
xargs -r0 customFunctionToProcessFiles

how to count the number of files an extension was just added to?

so I just added the extension .txt to all files in a directory, I want to go beyond that and now count the number of files whose extension I just changed. Any help is appreciated!

To know the number of .txt files, you can simply do ls | grep '.txt$' | wc -l
To know the number of file you change, you need to either count them while you change the extension, or count the number before, the number after, and substract them.
This last method can be done like this:
oldnum="$(ls | grep '.txt$' | wc -l)"
# Do the rename here
newnum="$(ls | grep '.txt$' | wc -l)"
result=$((newnum - oldnum)) # $result now hold the number of renamed files

I hope you didn't forget an hour when you had modified files.
For example, if you have modified files 1 hour ago, just run in working directory:
find . -maxdepth 1 -type f -name '*\.txt' -cmin -65
This code will print all the files with *.txt name who were modified less than 65 minutes ago.

How to recursively remove different files in two directories

I have 2 different recursive directories, in one directory have 200 .txt files in another have 210 .txt files, need a script to find the different file names and remove them from the directory.

There are probably better ways, but I think about:
find directory1 directory2 -name \*.txt -printf '%f\n' |
sort | uniq -u |
xargs -I{} find directory1 directory2 -name {} -delete
find directory1 directory2 -name \*.txt -printf '%f\n':
print basename of each file matching the glob *.txt
sort | uniq -u:
only print unique lines (if you wanted to delete duplicate, it would have been uniq -d)
xargs -I{} find directory1 directory2 -name {} -delete:
remove them (re-specify the path to narrow the search and avoid deleting files outside the initial search path)
Notes
Thank's to #KlausPrinoth for all the suggestions.
Obviously I'm assuming a GNU userland, I suppose people running with the tools providing bare minimum POSIX compatibility will be able to adapt it.

Yet another way is to use diff which is more than capable in finding file differences in files in directories. For instance if you have d1 and d2 that contain your 200 and 210 files respectively (with the first 200 files being the same), you could use diff and process substitution to provide the names to remove to a while loop:
( while read -r line; do printf "rm %s\n" ${line##*: }; done < <(diff -q d1 d2) )
Output (of d1 with 10 files, d2 with 12 files)
rm file11.txt
rm file12.txt
diff will not fit all circumstances, but is does a great job finding directory differences and is quite flexible.

Create a bash script to delete folders which do not contain a certain filetype

I have recently run into a problem.
I used a utility to move all my music files into directories based on tags. This left a LOT of almost empty folders. The folders, in general, contain a thumbs.db file or some sort of image for album art. The mp3s have the correct album art in their new directories, so the old ones are okay to delete.
Basically, I need to find any directories within D:/Music/ that:
-Do not have any subdirectories
-Do not contain any mp3 files
And then delete them.
I figured this would be easier to do in a shell script or bash script or whatever else linux/unix world than in Windows 8.1 (HAHA).
Any suggestions? I'm not very experienced writing scripts like this.

This should get you started
find /music -mindepth 1 -type d |
while read dt
do
find "$dt" -mindepth 1 -type d | read && continue
find "$dt" -iname '*.mp3' -type f | read && continue
echo DELETE $dt
done

Here's the short story...
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
find . -type d -print | sort | uniq > all-dirs.tmp
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
less dirs-to-be-deleted.tmp
cat dirs-to-be-deleted.tmp | xargs rm -rf
Note that you might have to run all the commands a few times (depending on your repository's directory depth) before you're done deleting all recursive empty directories...
And the long story goes...
You can approach this problem from two basic perspective: either you find all directories, then iterate over each of them, check if it contain any mp3 file or any subdirectory, if not, mark that directory for deletion. It will works, but on large very large repositories, you might expect a significant run time.
Another approach, which is in my sense much more interesting, is to build a list of directories NOT to be deleted, and subtract that list from the list of all directories. Let's work the second strategy, one step at a time...
First of all, to find the path of all directories that contains mp3 files, you can simply do:
find . -name '*.mp3' -printf '%h\n' | sort | uniq
This means "find any file ending with .mp3, then print the path to it's parent directory".
Now, I could certainly name at least ten different approaches to find directories that contains at least one subdirectory, but keeping the same strategy as above, we can easily get...
find . -type d -printf '%h\n' | sort | uniq
What this means is: "Find any directory, then print the path to it's parent."
Both of these queries can be combined in a single invocation, producing a single list containing the paths of all directories NOT to be deleted.. Let's redirect that list to a temporary file.
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
Let's similarly produce a file containing the paths of all directories, no matter if they are empty or not.
find . -type d -print | sort | uniq > all-dirs.tmp
So there, we have, on one side, the complete list of all directories, and on the other, the list of directories not to be deleted. What now? There are tons of strategies, but here's a very simple one:
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
Once you have that, well, review it, and if you are satisfied, then pipe it through xargs to rm to actually delete the directories.
cat dirs-to-be-deleted.tmp | xargs rm -rf

Finding and Listing Duplicate Words in a Plain Text file

I have a rather large file that I am trying to make sense of.
I generated a list of my entire directory structure that contains a lot of files using the du -ah command.
The result basically lists all the folders under a specific folder and the consequent files inside the folder in plain text format.
eg:
4.0G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_003.R3D
3.1G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_004.R3D
15G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC
Is there any command that I can run or utility that I can use that will help me identify if there is more than one record of the same filename (usually the last 16 characters in each line + extension) and if such duplicate entries exist, to write out the entire path (full line) to a different text file so i can find and move out duplicate files from my NAS, using a script or something.
Please let me know as this is incredibly stressful to do when the plaintext file itself is 5.2Mb :)

Split each line on /, get the last item (cut cannot do it, so revert each line and take the first one), then sort and run uniq with -d which shows duplicates.
rev FILE | cut -f1 -d/ | rev | sort | uniq -d

I'm not entirely sure what you want to achieve here, but I have the feeling that you are doing it in a difficult way anyway :) Your text file seems to contain spaces in files which make it hard to parse.
I take it that you want to find all files whose name is duplicate. I would start with something like:
find DIR -type f -printf '%f\n' | uniq -d
That means
DIR - look for files in this directory
'-type f' - print only files (not directories or other special files)
-printf '%f' - do not use default find output format, print only file name of each file
uniq -d - print only lines which occur multiple times
You may want to list only some files, not all of them. You can limit which files are taken into account by more rules to find. If you care only about *.R3D and *.RDC files you can use
find . \( -name '*.RDC' -o -name '*.R3D' \) -type f -printf '%f\n' | ...
If I wrongly guessed what you need, sorry :)

I think you are looking for fslint: http://www.pixelbeat.org/fslint/
It can find duplicate files, broken links, and stuff like that.

The following will scan the current subdirectory (using find) and print the full path to duplicate files. You can adapt it take a different action, e.g. delete/move the duplicate files.
while IFS="|" read FNAME LINE; do
# FNAME contains the filename (without dir), LINE contains the full path
if [ "$PREV" != "$FNAME" ]; then
PREV="$FNAME" # new filename found. store
else
echo "Duplicate : $LINE" # duplicate filename. Do something with it
fi
done < <(find . -type f -printf "%f|%p\n" | sort -s)
To try it out, simply copy paste that into a bash shell or save it as a script.
Note that:
due to the sort, the list of files will have to be loaded into memory before the loop begins so the performance will be affected by the number of files returned
the order the files appears after a sort will affect which files are treated as duplicates since the first occurence is assumed to be the original. The -s options ensures a stable sort, which means the order will be dictated by find.
A more straight-forward by less robust robust approach would be something along the lines of:
find . -type f -printf "%20f %p\n" | sort | uniq -D -w20 | cut -c 22-
That will print all files that have duplicate entries, assuming that the longest filename will be 30 characters long. The output differs from the solution above in all entries with the same name are listed (not N-1 entries as above).
You'll need to change the numbers in the find, uniq and cut commands to match the actual case. A number too small may result in false positives.
find . -type f -printf "%20f %p\n" | sort | uniq -D -w20 | cut -c 22-
---------------------------------- ---- ------------ ----------
| | | |
Find all files in current dir | | |
and subdirs and print out | print out all |
the filename (padded to 20 | entries that |
characters) followed by the | have duplicates |
full path | but only look at |
| the first 20 chars |
| |
Sort the output Discard the first
21 chars of each line

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string