Finding and Listing Duplicate Words in a Plain Text file

Finding and Listing Duplicate Words in a Plain Text file - linux

I have a rather large file that I am trying to make sense of.
I generated a list of my entire directory structure that contains a lot of files using the du -ah command.
The result basically lists all the folders under a specific folder and the consequent files inside the folder in plain text format.
eg:
4.0G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_003.R3D
3.1G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_004.R3D
15G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC
Is there any command that I can run or utility that I can use that will help me identify if there is more than one record of the same filename (usually the last 16 characters in each line + extension) and if such duplicate entries exist, to write out the entire path (full line) to a different text file so i can find and move out duplicate files from my NAS, using a script or something.
Please let me know as this is incredibly stressful to do when the plaintext file itself is 5.2Mb :)

Split each line on /, get the last item (cut cannot do it, so revert each line and take the first one), then sort and run uniq with -d which shows duplicates.
rev FILE | cut -f1 -d/ | rev | sort | uniq -d

I'm not entirely sure what you want to achieve here, but I have the feeling that you are doing it in a difficult way anyway :) Your text file seems to contain spaces in files which make it hard to parse.
I take it that you want to find all files whose name is duplicate. I would start with something like:
find DIR -type f -printf '%f\n' | uniq -d
That means
DIR - look for files in this directory
'-type f' - print only files (not directories or other special files)
-printf '%f' - do not use default find output format, print only file name of each file
uniq -d - print only lines which occur multiple times
You may want to list only some files, not all of them. You can limit which files are taken into account by more rules to find. If you care only about *.R3D and *.RDC files you can use
find . \( -name '*.RDC' -o -name '*.R3D' \) -type f -printf '%f\n' | ...
If I wrongly guessed what you need, sorry :)

I think you are looking for fslint: http://www.pixelbeat.org/fslint/
It can find duplicate files, broken links, and stuff like that.

The following will scan the current subdirectory (using find) and print the full path to duplicate files. You can adapt it take a different action, e.g. delete/move the duplicate files.
while IFS="|" read FNAME LINE; do
# FNAME contains the filename (without dir), LINE contains the full path
if [ "$PREV" != "$FNAME" ]; then
PREV="$FNAME" # new filename found. store
else
echo "Duplicate : $LINE" # duplicate filename. Do something with it
fi
done < <(find . -type f -printf "%f|%p\n" | sort -s)
To try it out, simply copy paste that into a bash shell or save it as a script.
Note that:
due to the sort, the list of files will have to be loaded into memory before the loop begins so the performance will be affected by the number of files returned
the order the files appears after a sort will affect which files are treated as duplicates since the first occurence is assumed to be the original. The -s options ensures a stable sort, which means the order will be dictated by find.
A more straight-forward by less robust robust approach would be something along the lines of:
find . -type f -printf "%20f %p\n" | sort | uniq -D -w20 | cut -c 22-
That will print all files that have duplicate entries, assuming that the longest filename will be 30 characters long. The output differs from the solution above in all entries with the same name are listed (not N-1 entries as above).
You'll need to change the numbers in the find, uniq and cut commands to match the actual case. A number too small may result in false positives.
find . -type f -printf "%20f %p\n" | sort | uniq -D -w20 | cut -c 22-
---------------------------------- ---- ------------ ----------
| | | |
Find all files in current dir | | |
and subdirs and print out | print out all |
the filename (padded to 20 | entries that |
characters) followed by the | have duplicates |
full path | but only look at |
| the first 20 chars |
| |
Sort the output Discard the first
21 chars of each line

Related

How to list files in a directory, sorted by size, but without listing folder sizes?

I'm writing a bash script that should output the first 10 heaviest files in a directory and all subfolders (the directory is passed to the script as an argument).
And for this I use the following command:
sudo du -haS ../../src/ | sort -hr
, but its output contains folder sizes, and I only need files. Help!

Why using du at all? You could do a
ls -S1AF
This will list all entries in the current directoriy, sorted descending by size. It will also include the names of the subdirectories, but they will be at the end (because the size of a directory entry is always zero), and you can recognize them because they have a slash at the end.
To exclude those directories and pick the first 10 lines, you can do a
ls -S1AF | head -n 10 | grep -v '/$'
UPDATE:
If your directory contains not only subdirectories, but also files of length zero, some of those empty files might not be shown in the output, as pointed out in the comment by F.Hauri. If this is an issue for your application, I suggest to exchange the order and do a
ls -S1AF | grep -v '/$' | head -n 10
instead.

Would you please try the following:
dir="../../src/"
sudo find "$dir" -type f -printf "%s\t%p\n" | sort -nr | head -n 10 | cut -f2-
find "$dir" -type f searches $dir for files recursively.
The -printf "%s\t%p\n" option tells find to print the filesize
and the filename delimited by a tab character.
The final cut -f2- in the pipeline prints the 2nd and the following
columns, dropping the filesize column only.
It will work with the filenames which contain special characters such as a whitespace except for
a newline character.

How to use GNU find command to find files by pattern and list files in order of most recent modification to least?

I want to use the GNU find command to find files based on a pattern, and then have them displayed in order of the most recently modified file to the least recently modified.
I understand this:
find / -type f -name '*.md'
but then what would be added to sort the files from the most recently modified to the least?

find can't sort files, so you can instead output the modification time plus filename, sort on modification time, then remove the modification time again:
find . -type f -name '*.md' -printf '%T# %p\0' | # Print time+name
sort -rnz | # Sort numerically, descending
cut -z -d ' ' -f 2- | # Remove time
tr '\0' '\n' # Optional: make human readable
This uses \0-separated entries to avoid problems with any kind of filenames. You can pass this directly and safely to a number of tools, but here it instead pipes to tr to show the file list as individual lines.

find <dir> -name "*.mz" -printf "%Ts - %h/%f\n" | sort -rn
Print the modified time in epoch format (%Ts) as well as the directories (%h) and file name (%f). Pipe this through to sort -rn to sort in reversed number order.

Pipe the output of find to xargs and ls:
find / -type f -name '*.md' | xargs ls -1t

Unix - Only list directories which contain a subdirectory

How can I print in the Unix shell the number of directories in a tree which contain other directories?
I haven't found a solution yet with commands like find or ls.

You can use find command: find . -type d -not -empty
That will print every subdirectory that is not empty. You can control how deep you want the search with -maxdepth.
To print the number, you can use wc -l.
find . -type d -not -empty | wc -l

If you generate a list of all the directories under a particular directory, and then remove the last component from the name, you have a list of the directories containing subdirectories, but there are likely to be repeats in that list. So, you need to post-process the list, yielding (as a first approximation):
find ${base:-.} -type d |
sed 's%/[^/]*$%%' |
sort -u
Find all the directories under the directory or directories listed in variable $base, defaulting to the current directory, and print their names. The code assumes you don't have directories with a newline in the name. If you do, there are fixes, but the best fix is to rename the directory. The sed command removes the last slash and everything after it. The sort eliminates duplicate entries. What's left is the list of directories containing subdirectories.
Well, more or less. There's the degenerate case to consider: the top-level directories in the list will be listed regardless of whether they have sub-directories or not. Fixing that is a bit harder. You need to eliminate any lines of output that exactly match the directories specified to find before removing trailing material. So, you need something like:
{
printf '\\#^%s$#d\n' ${base:-.}
echo 's%/[^/]*$%%'
} > sed.script
find ${base:-.} -type d |
sed -f sed.script |
sort -u
rm -f sed.script
The \\#^%s$#d assumes you don't use # in directory names. If you do use it, then you need to find a character you don't use in names (maybe Control-A) and use that in place of the #. If you could face absolutely any character, then you'll need to do more work escaping some obscure character, such as Control-A, when it appears in a directory name.
There's a problem still: using a fixed name like sed.script for a temporary file name is bad (for multiple reasons — such as two people trying to run the script at the same time in the same directory, though it can also be a security risk), so use mktemp to create a temporary file name:
tmp=$(mktemp ${TMPDIR:-/tmp}/dircnt.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
{
printf '\\#^%s$#d\n' ${base:-.}
echo 's%/[^/]*$%%'
} > $tmp
find ${base:-.} -type d |
sed -f $tmp |
sort -u
rm -f $tmp
trap 0
This deals with the most common signals (HUP, INT, QUIT, PIPE, TERM) and removes the temporary file even if one of those arrives.
Clearly, if you want to simply count the number of directories, you can pipe the output from the commands above through wc -l to get the count.

ls -1d */*/. | cut -d / -f1 | uniq

Create a bash script to delete folders which do not contain a certain filetype

I have recently run into a problem.
I used a utility to move all my music files into directories based on tags. This left a LOT of almost empty folders. The folders, in general, contain a thumbs.db file or some sort of image for album art. The mp3s have the correct album art in their new directories, so the old ones are okay to delete.
Basically, I need to find any directories within D:/Music/ that:
-Do not have any subdirectories
-Do not contain any mp3 files
And then delete them.
I figured this would be easier to do in a shell script or bash script or whatever else linux/unix world than in Windows 8.1 (HAHA).
Any suggestions? I'm not very experienced writing scripts like this.

This should get you started
find /music -mindepth 1 -type d |
while read dt
do
find "$dt" -mindepth 1 -type d | read && continue
find "$dt" -iname '*.mp3' -type f | read && continue
echo DELETE $dt
done

Here's the short story...
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
find . -type d -print | sort | uniq > all-dirs.tmp
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
less dirs-to-be-deleted.tmp
cat dirs-to-be-deleted.tmp | xargs rm -rf
Note that you might have to run all the commands a few times (depending on your repository's directory depth) before you're done deleting all recursive empty directories...
And the long story goes...
You can approach this problem from two basic perspective: either you find all directories, then iterate over each of them, check if it contain any mp3 file or any subdirectory, if not, mark that directory for deletion. It will works, but on large very large repositories, you might expect a significant run time.
Another approach, which is in my sense much more interesting, is to build a list of directories NOT to be deleted, and subtract that list from the list of all directories. Let's work the second strategy, one step at a time...
First of all, to find the path of all directories that contains mp3 files, you can simply do:
find . -name '*.mp3' -printf '%h\n' | sort | uniq
This means "find any file ending with .mp3, then print the path to it's parent directory".
Now, I could certainly name at least ten different approaches to find directories that contains at least one subdirectory, but keeping the same strategy as above, we can easily get...
find . -type d -printf '%h\n' | sort | uniq
What this means is: "Find any directory, then print the path to it's parent."
Both of these queries can be combined in a single invocation, producing a single list containing the paths of all directories NOT to be deleted.. Let's redirect that list to a temporary file.
find . -name '*.mp3' -o -type d -printf '%h\n' | sort | uniq > non-empty-dirs.tmp
Let's similarly produce a file containing the paths of all directories, no matter if they are empty or not.
find . -type d -print | sort | uniq > all-dirs.tmp
So there, we have, on one side, the complete list of all directories, and on the other, the list of directories not to be deleted. What now? There are tons of strategies, but here's a very simple one:
comm -23 all-dirs.tmp non-empty-dirs.tmp > dirs-to-be-deleted.tmp
Once you have that, well, review it, and if you are satisfied, then pipe it through xargs to rm to actually delete the directories.
cat dirs-to-be-deleted.tmp | xargs rm -rf

Counting files contained in a directory

How can I count all files, hidden files, directories, hidden directories, sub-directories, hidden sub-directories and (symbolic) links in a given directory using bash?

find . | wc -l
This will count each symlink as a file. To traverse symlinks, counting their contents, use:
find -L . | wc -l

find . -print0 | tr -cd '\0' | wc -c
This handles filenames with newline characters.

This does it:
find the_directory|wc -l
This works be finding all files in the directory, and counting them.

You can also use
tree
it gives you a count in the end. I don't know how the speed compares with find. Lazily:
tree | tail -1
easier to type than find :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string