How to list files in a directory, sorted by size, but without listing folder sizes? - linux

I'm writing a bash script that should output the first 10 heaviest files in a directory and all subfolders (the directory is passed to the script as an argument).
And for this I use the following command:
sudo du -haS ../../src/ | sort -hr
, but its output contains folder sizes, and I only need files. Help!

Why using du at all? You could do a
ls -S1AF
This will list all entries in the current directoriy, sorted descending by size. It will also include the names of the subdirectories, but they will be at the end (because the size of a directory entry is always zero), and you can recognize them because they have a slash at the end.
To exclude those directories and pick the first 10 lines, you can do a
ls -S1AF | head -n 10 | grep -v '/$'
UPDATE:
If your directory contains not only subdirectories, but also files of length zero, some of those empty files might not be shown in the output, as pointed out in the comment by F.Hauri. If this is an issue for your application, I suggest to exchange the order and do a
ls -S1AF | grep -v '/$' | head -n 10
instead.

Would you please try the following:
dir="../../src/"
sudo find "$dir" -type f -printf "%s\t%p\n" | sort -nr | head -n 10 | cut -f2-
find "$dir" -type f searches $dir for files recursively.
The -printf "%s\t%p\n" option tells find to print the filesize
and the filename delimited by a tab character.
The final cut -f2- in the pipeline prints the 2nd and the following
columns, dropping the filesize column only.
It will work with the filenames which contain special characters such as a whitespace except for
a newline character.

Related

Find the maximum files containing directory in System

How do I write a shell script to print the name of the directory in the System that contains maximum number of files (including hidden files)?
In order to find the directory containing most ordinary files, you can do something like the following.
$ find / -type f -print0 | xargs -n 1 --null dirname | sort | uniq -c | sort -n | tail -n 1
It will output the directory with highest number of ordinary files, starting with the number of files followed by the directory name.
Use of xargs -n 1 is needed for compatibility with older versions of dirname.

Counting Amount of Files in Directory Including Hidden Files with BASH

I want to count the amount of files in the directory I am currently in (including hidden files). So far I have this:
ls -1a | wc -l
but I believe this returns 2 more than what I want because it also counts "." (current directory) and ".." (directory above this one) as files. How would I go about returning the correct amount of files?
I believe to count all files / directories / hidden file you can also use BASH array like this:
shopt -s nullglob dotglob
cd /whatever/path
arr=( * )
count="${#arr[#]}"
This also works with filenames that contain space or newlines.
Edit:
ls piped to wc is not the right tool for that job. This is because filenames in UNIX can contain newlines as well. This would lead to counting them multiple times.
Following #gniourf_gniourf's comment (thanks!) the following command will handle newlines in file names correctly and should be used:
find -mindepth 1 -maxdepth 1 -printf x | wc -c
The find command lists files in the current directory - including hidden files, excluding the . and .. because of -mindepth 1. It works non-recursively because of -maxdepth 1.
The -printf x action simply prints an x for each file in the directory which leads to an output like this:
xxxxxxxx
Piped to wc -c (-c means counting characters) you get your final result.
Former Answer:
Use the following command:
ls -1A | wc -l
-a will include all files or directories starting with a dot, but -A will exclude the current folder . and the parent folder ..
I suggest to follow man ls
You almost got it right:
ls -1A | wc -l
If you filenames contain new-lines or other funny characters do:
find -type f -ls | wc -l

Finding and Listing Duplicate Words in a Plain Text file

I have a rather large file that I am trying to make sense of.
I generated a list of my entire directory structure that contains a lot of files using the du -ah command.
The result basically lists all the folders under a specific folder and the consequent files inside the folder in plain text format.
eg:
4.0G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_003.R3D
3.1G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC/B119_C004_0918XJ_004.R3D
15G ./REEL_02/SCANS/200113/001/Promise Pegasus/BMB 10/RED EPIC DATA/R3D/18-09-12/CAM B/B119_0918NO/B119_0918NO.RDM/B119_C004_0918XJ.RDC
Is there any command that I can run or utility that I can use that will help me identify if there is more than one record of the same filename (usually the last 16 characters in each line + extension) and if such duplicate entries exist, to write out the entire path (full line) to a different text file so i can find and move out duplicate files from my NAS, using a script or something.
Please let me know as this is incredibly stressful to do when the plaintext file itself is 5.2Mb :)
Split each line on /, get the last item (cut cannot do it, so revert each line and take the first one), then sort and run uniq with -d which shows duplicates.
rev FILE | cut -f1 -d/ | rev | sort | uniq -d
I'm not entirely sure what you want to achieve here, but I have the feeling that you are doing it in a difficult way anyway :) Your text file seems to contain spaces in files which make it hard to parse.
I take it that you want to find all files whose name is duplicate. I would start with something like:
find DIR -type f -printf '%f\n' | uniq -d
That means
DIR - look for files in this directory
'-type f' - print only files (not directories or other special files)
-printf '%f' - do not use default find output format, print only file name of each file
uniq -d - print only lines which occur multiple times
You may want to list only some files, not all of them. You can limit which files are taken into account by more rules to find. If you care only about *.R3D and *.RDC files you can use
find . \( -name '*.RDC' -o -name '*.R3D' \) -type f -printf '%f\n' | ...
If I wrongly guessed what you need, sorry :)
I think you are looking for fslint: http://www.pixelbeat.org/fslint/
It can find duplicate files, broken links, and stuff like that.
The following will scan the current subdirectory (using find) and print the full path to duplicate files. You can adapt it take a different action, e.g. delete/move the duplicate files.
while IFS="|" read FNAME LINE; do
# FNAME contains the filename (without dir), LINE contains the full path
if [ "$PREV" != "$FNAME" ]; then
PREV="$FNAME" # new filename found. store
else
echo "Duplicate : $LINE" # duplicate filename. Do something with it
fi
done < <(find . -type f -printf "%f|%p\n" | sort -s)
To try it out, simply copy paste that into a bash shell or save it as a script.
Note that:
due to the sort, the list of files will have to be loaded into memory before the loop begins so the performance will be affected by the number of files returned
the order the files appears after a sort will affect which files are treated as duplicates since the first occurence is assumed to be the original. The -s options ensures a stable sort, which means the order will be dictated by find.
A more straight-forward by less robust robust approach would be something along the lines of:
find . -type f -printf "%20f %p\n" | sort | uniq -D -w20 | cut -c 22-
That will print all files that have duplicate entries, assuming that the longest filename will be 30 characters long. The output differs from the solution above in all entries with the same name are listed (not N-1 entries as above).
You'll need to change the numbers in the find, uniq and cut commands to match the actual case. A number too small may result in false positives.
find . -type f -printf "%20f %p\n" | sort | uniq -D -w20 | cut -c 22-
---------------------------------- ---- ------------ ----------
| | | |
Find all files in current dir | | |
and subdirs and print out | print out all |
the filename (padded to 20 | entries that |
characters) followed by the | have duplicates |
full path | but only look at |
| the first 20 chars |
| |
Sort the output Discard the first
21 chars of each line

Counting files contained in a directory

How can I count all files, hidden files, directories, hidden directories, sub-directories, hidden sub-directories and (symbolic) links in a given directory using bash?
find . | wc -l
This will count each symlink as a file. To traverse symlinks, counting their contents, use:
find -L . | wc -l
find . -print0 | tr -cd '\0' | wc -c
This handles filenames with newline characters.
This does it:
find the_directory|wc -l
This works be finding all files in the directory, and counting them.
You can also use
tree
it gives you a count in the end. I don't know how the speed compares with find. Lazily:
tree | tail -1
easier to type than find :-)

Copy the three newest files under one directory (recursively) to another specified directory

I'm using bash.
Suppose I have a log file directory /var/myprogram/logs/.
Under this directory I have many sub-directories and sub-sub-directories that include different types of log files from my program.
I'd like to find the three newest files (modified most recently), whose name starts with 2010, under /var/myprogram/logs/, regardless of sub-directory and copy them to my home directory.
Here's what I would do manually
1. Go through each directory and do ls -lt 2010*
to see which files starting with 2010 are modified most recently.
2. Once I go through all directories, I'd know which three files are the newest. So I copy them manually to my home directory.
This is pretty tedious, so I wondered if maybe I could somehow pipe some commands together to do this in one step, preferably without using shell scripts?
I've been looking into find, ls, head, and awk that I might be able to use but haven't figured the right way to glue them together.
Let me know if I need to clarify. Thanks.
Here's how you can do it:
find -type f -name '2010*' -printf "%C#\t%P\n" |sort -r -k1,1 |head -3 |cut -f 2-
This outputs a list of files prefixed by their last change time, sorts them based on that value, takes the top 3 and removes the timestamp.
Your answers feel very complicated, how about
for FILE in find . -type d; do ls -t -1 -F $FILE | grep -v "/" | head -n3 | xargs -I{} mv {} ..; done;
or laid out nicely
for FILE in `find . -type d`;
do
ls -t -1 -F $FILE | grep -v "/" | grep "^2010" | head -n3 | xargs -I{} mv {} ~;
done;
My "shortest" answer after quickly hacking it up.
for file in $(find . -iname *.php -mtime 1 | xargs ls -l | awk '{ print $6" "$7" "$8" "$9 }' | sort | sed -n '1,3p' | awk '{ print $4 }'); do cp $file ../; done
The main command stored in $() does the following:
Find all files recursively in current directory matching (case insensitive) the name *.php and having been modified in the last 24 hours.
Pipe to ls -l, required to be able to sort by modification date, so we can have the first three
Extract the modification date and file name/path with awk
Sort these files based on datetime
With sed print only the first 3 files
With awk print only their name/path
Used in a for loop and as action copy them to the desired location.
Or use #Hasturkun's variant, which popped as a response while I was editing this post :)

Resources