Listing the results of the du command in alphabetical order - linux

How can I list the results of the du command in alphabetical order?
I know I can use the find command to list them alphabetically, but without the directory size, I also use the -maxdepth option for both commands so that the listing only goes down one subdirectory.
Here's the question in italics:
Write a shell script that implements a directory size analyzer. In your script you may use common Linux commands. The script should list the disk storage occupied by each immediate subdirectory of a given argument or the current directory (if no argument is given) with the subdirectory names sorted alphabetically. Also, list the name of the subdirectory with the highest disk usage along with its storage size. If more than one subdirectory has the same highest disk usage, list any one of those subdirectories. Include meaningful brief comments. List of bash commands applicable for this script includes the following but not limited: cat, cut, du, echo, exit, for, head, if, ls, rm, sort, tail, wc. You may use bash variables as well as temporary files to hold intermediate results. Delete all temporary files at the end of the execution.
Here is my result after entering du $dir -hk --max-depth=2 | sort -o temp1.txt then cat temp1.txt in the command line:
12 ./IT_PLAN/Inter_Disciplinary
28 ./IT_PLAN
3 ./IT_PLAN/Core_Courses
3 ./IT_PLAN/Pre_reqs
81 .
9 ./IT_PLAN/IT_Electives
It should look like this:
28 ./IT_PLAN
3 ./IT_PLAN/Core_Courses
12 ./IT_PLAN/Inter_Disciplinary
9 ./IT_PLAN/IT_Electives
The subdirectory with the maximum disk space use:
28 ./IT_PLAN
Once again, I'm having trouble sorting the results alphabetically.

Try doing this :
du $dir -hk --max-depth=2 | sort -k2
-k2 is the column number 2
See http://www.manpagez.com/man/1/sort/

du $dir -hk --max-depth=2 |awk '{print $2"\t"$1}'|sort -d -k1 -o temp1.txt
and if you want to remove the ./ path
du $dir -hk --max-depth=2 |awk '{print $2"\t"$1}'|sed -e 's/\.\///g'|sort -d -k1 -o temp1.txt

Related

Calculate the total size of all files from a generated folders list with full PATH

I have a list containing multiple directories with the full PATH:
/mnt/directory_1/sub_directory_1/
/mnt/directory_2/
/mnt/directory_3/sub_directory_3/other_directories_3/
I need to calculated what the total size is of this list.
From Get total size of a list of files in UNIX
du -ch $file_list | tail -1 | cut -f 1
This was the closest of an answer I could find but gave me the following error message:
bash: /bin/du: Argument list too long
Do not use backticks `. Use $(..) instead.
Do not use:
command $(cat something)
this is a common anti-pattern. It works for simple cases, fails for many more, because the result of $(...) undergoes word splitting and filename expansion.
Check your scripts with http://shellcheck.net
If you want to "run a command with argument from a file" use xargs or write a loop. Read https://mywiki.wooledge.org/BashFAQ/001 . Also xargs will handle too many arguments by itself. And I would also add -s to du. Try:
xargs -d'\n' du -sch < file_list.txt | tail -1 | cut -f 1
test on repl bash

Unix: Sort 'ls' by return value of program

how can I use program as a key to sort in Unix shell? In other words to sort output of 'ls' (or any other program) by return value of a program applied on each line.
I'll give two example solutions:
A one-line command that is simpler and therefore something I'd try use first.
A bash script that allows sorting a list by output from an arbitrary bash function that reads each line of the list as input.
Example 1 (without executing command on each line)
If the question is how to, in general, sort outputs of programs like ls, below is an example specific to ls that sorts by inode. However, every program may have its own idiosyncrasies when generating its output so this example may have to be adapted:
ls -ail /home/user/ | tail -n+2 | tr -s ' ' | sort -t' ' -k1,1 -g
Here are the different parts of this command broken down:
ls -ail /home/user/
Lists all (-a) files in directory /home/user/ in list (-l) format with inode (-i).
tail -n+1
Cuts off first line from ls output.
tr -s ' '
Combines (-s) multiple spaces (' ') for sort.
sort -t ' ' -k 1 -g
Sorts list by first (1) field of integers (-g) separated by one space (' ').
Example 2 (executing command with each line as input)
Here is a more adaptable example in a bash script I worked up to show how the list of files generated from ls -a1 can be fed into bash function getinode which uses stat to output the inode for each file. A while loop repeats this process for each file, saving in comma-delimited format the data by repeatedly appending a variable named OUTPUT which at the end is sorted by sort using the first field.
The important part is that the function getinode can be anything, so long as it outputs a string. I set up getinode to receive a file path as input (first argument $1) and to then output the inode to stdout via echo $INODE. The script calls getinode via $(getinode "$FILEPATH").
#!/bin/bash
# Usage: lsinodesort.sh [file]
# Refs/attrib:
# [1]: How to sort a csv file by sorting on a single field. https://stackoverflow.com/a/44744800
# [2]: How to read a while loop variable. https://stackoverflow.com/a/16854326
WORKDIR="$1" # read directory from first argument
getinode() {
# Usage: getinode [path]
INODE="$(stat "$1" --format=%i)"
echo $INODE
}
if [ -d "$WORKDIR" ]; then
LINES="$(ls -a1 "$WORKDIR")" # save `ls` output to variable LINES
else
exit 1; # not a valid directory
fi
while read line; do
path="$WORKDIR"/"$line" # Determine path.
if [ -f "$path" ]; then # Check if path is a file.
FILEPATH="$path"
FILENAME="$(basename "$path")" # Determine filename from path.
FILEINODE=$(getinode "$FILEPATH") # Get inode.
OUTPUT="$FILEINODE"",""$FILENAME""\n""$OUTPUT" ; # Append inode and file name to OUTPUT
fi
done <<< "$LINES" # See [2].
OUTPUT=$(printf "${OUTPUT}" | sort -t, -k1,1) # sort OUTPUT. See [1]
OUTPUT="inode","filename""\n""$OUTPUT"
printf "${OUTPUT}\n" # print final OUTPUT.
When I run it on my own home folder I get output like this:
inode,filename
3932162,.bashrc
3932165,.bash_logout
3932382,.zshrc
3932454,.gitconfig
3933234,.bash_aliases
3933512,.profile
3933612,.viminfo
I'm not sure to understand your question, so I'll try to rephrase it first.
If I'm not mistaken, you want to sort the output of a program (it may be ls or any other command in a Unix shell).
I'll suggest using the pipeline feature available on Unix shell.
For instance, you can sort the output of the ls command using :
ls /home | sort
This feature is available but not limited to the ls command.
By the way, there are optional flags you can use for sorting ls command results if that's your specific use case :
ls -S # for sorting by file size
ls -t # for sorting by modification time
You can also append the --reverse or -r flag for displaying the result in reverse order.
As for the sort function, there are also flags allowing to customize your result as per your needs :
sort -n # for sorting numerically instead of alphabetically
sort -k5 # for sorting based on the 5th column
sort -t "," # for using the comma as a field separator
You can combine all of them like that for sorting the output of ‘ls -l‘ command on the basis of field 2,5 (Numeric) and 9 (Non-Numeric/alphabetically).
ls -l /home/$USER | sort -t "," -nk2,5 -k9
sort function examples

List file using ls to find meet the condition

I am writing a batch program to delete all file in a directory with condition in filename.
In the directory there's a large number of text file (~ hundreds of thousand of files) with filename fixed as "abc" + date
abc_20180820.txt
abc_20180821.txt
abc_20180822.txt
abc_20180823.txt
abc_20180824.txt
The program try to grep all the file, compare the date to a fixed-date, delete it if filename's date < fixed date.
But the problem is it took so long to handle that large amount of file (~1 hour to delete 300k files).
My question: Is there a way to compare the date when running ls command? Not get all file in a list then compare to delete, but list only file already meet the condition then delete. I think that will have better performance.
My code is
TARGET_DATE = "5-12"
DEL_DATE = "20180823"
ls -t | grep "[0-9]\{8\}".txt\$ > ${LIST}
for EACH_FILE in `cat ${LIST}` ;
do
DATE=`echo ${EACH_FILE} | cut -c${TARGET_DATE }`
COMPARE=`expr "${DATE}" \< "${DEL_DATE}"`
if [ $COMPARE -eq 1 ] ;
then
rm -f ${EACH_FILE}
fi
done
Found some similar problem but I dont know how to get it done
List file using ls with a condition and process/grep files that only whitespaces
Here is a refactoring which gets rid of the pesky ls. Looping over a large directory is still going to be somewhat slow.
# Use lowercase for private variables
# to avoid clobbering a reserved system variable
# You can't have spaces around the equals sign
del_date="20180823"
# No need for ls here
# No need for a temporary file
for filename in *[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].txt
do
# Avoid external process; use the shell's parameter substitution
date=${filename%.txt}
# This could fail if the file name contains literal shell metacharacters!
date=${date#${date%?????????}}
# Avoid expr
if [ "$date" -lt "$del_date" ]; then
# Just print the file name, null-terminated for xargs
printf '%s\0' "$filename"
fi
done |
# For efficiency, do batch delete
xargs -r0 rm
The wildcard expansion will still take a fair amount of time because the shell will sort the list of filenames. A better solution is probably to refactor this into a find command which avoids the sorting.
find . -maxdepth 1 -type f \( \
-name '*1[89][0-9][0-9][0-9][0-9][0-9][0-9].txt' \
-o -name '*201[0-7][0-9][0-9][0-9][0-9].txt' \
-o -name '*20180[1-7][0-9][0-9].txt ' \
-o -name '*201808[01][0-9].txt' \
-o -name '*2018082[0-2].txt' \
\) -delete
You could do something like:
rm 201[0-7]*.txt # remove all files from 2010-2017
rm 20180[1-4]*.txt # remove all files from Jan-Apr 2018
# And so on
...
to remove a large number of files. Then your code would run faster.
Yes it takes a lot of time if you have so many files in one folder.
It is bad idea to keep so many files in one folder. Even simple ls or find will be killing storage. And if you have some scripts which iterate over your files, you are for sure killing storage.
So after you wait for one hour to clean it. Take time and make better folders structure. It is good idea to sort files according to years/month/days ... possibly hours
e.g.
somefolder/2018/08/24/...files here
Then you can easily delete, move compress ... whole month or year.
I found a solution in this thread.
https://unix.stackexchange.com/questions/199554/get-files-with-a-name-containing-a-date-value-less-than-or-equal-to-a-given-inpu
The awk command is so powerful, only take me ~1 minute to deal with hundreds of thousand of files (1/10 compare to the loop).
ls | awk -v date="$DEL_DATE" '$0 <= date' | xargs rm -vrf
I can even count, copy, move with that command with the fastest answer I've ever seen.
COUNT="$(ls | awk -v date="${DEL_DATE}" '$0 <= target' | xargs rm -vrf | wc -l)"

Clearing archive files with linux bash script

Here is my problem,
I have a folder where is stored multiple files with a specific format:
Name_of_file.TypeMM-DD-YYYY-HH:MM
where MM-DD-YYYY-HH:MM is the time of its creation. There could be multiple files with the same name but not the same time of course.
What i want is a script that can keep the 3 newest version of each file.
So, I found one example there:
Deleting oldest files with shell
But I don't want to delete a number of files but to keep a certain number of newer files. Is there a way to get that find command, parse in the Name_of_file and keep the 3 newest???
Here is the code I've tried yet, but it's not exactly what I need.
find /the/folder -type f -name 'Name_of_file.Type*' -mtime +3 -delete
Thanks for help!
So i decided to add my final solution in case anyone liked to get it. It's a combination of the 2 solutions given.
ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}" | awk 'NR > 3' | xargs rm
One line, super efficiant. If anything changes on the pattern of date or name just change the grep -P pattern to match it. This way you are sure that only the files fitting this pattern will get deleted.
Can you be extra, extra sure that the timestamp on the file is the exact same timestamp on the file name? If they're off a bit, do you care?
The ls command can sort files by timestamp order. You could do something like this:
$ ls -t | awk 'NR > 3' | xargs rm
THe ls -t lists the files by modification time where the newest are first.
The `awk 'NR > 3' prints out the list of files except for the first three lines which are the three newest.
The xargs rm will remove the files that are older than the first three.
Now, this isn't the exact solution. There are possible problems with xargs because file names might contain weird characters or whitespace. If you can guarantee that's not the case, this should be okay.
Also, you probably want to group the files by name, and keep the last three. Hmm...
ls | sed 's/MM-DD-YYYY-HH:MM*$//' | sort -u | while read file
do
ls -t $file* | awk 'NR > 3' | xargs rm
done
The ls will list all of the files in the directory. The sed 's/\MM-DD-YYYY-HH:MM//' will remove the date time stamp from the files. Thesort -u` will make sure you only have the unique file names. Thus
file1.txt-01-12-1950
file2.txt-02-12-1978
file2.txt-03-12-1991
Will be reduced to just:
file1.txt
file2.txt
These are placed through the loop, and the ls $file* will list all of the files that start with the file name and suffix, but will pipe that to awk which will strip out the newest three, and pipe that to xargs rm that will delete all but the newest three.
Assuming we're using the date in the filename to date the archive file, and that is possible to change the date format to YYYY-MM-DD-HH:MM (as established in comments above), here's a quick and dirty shell script to keep the newest 3 versions of each file within the present working directory:
#!/bin/bash
KEEP=3 # number of versions to keep
while read FNAME; do
NODATE=${FNAME:0:-16} # get filename without the date (remove last 16 chars)
if [ "$NODATE" != "$LASTSEEN" ]; then # new file found
FOUND=1; LASTSEEN="$NODATE"
else # same file, different date
let FOUND="FOUND + 1"
if [ $FOUND -gt $KEEP ]; then
echo "- Deleting older file: $FNAME"
rm "$FNAME"
fi
fi
done < <(\ls -r | grep -P "(.+)\d{4}-\d{2}-\d{2}-\d{2}:\d{2}")
Example run:
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2011-12-12-12:11
some_file.exe2012-01-11-23:11
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
[me#home]$ ./delete_old.sh
- Deleting older file: some_file.exe2012-01-11-23:11
- Deleting older file: some_file.exe2011-12-12-12:11
[me#home]$ ls
another_file.txt2011-02-11-08:05
another_file.txt2012-12-09-23:13
delete_old.sh
not_an_archive.jpg
some_file.exe2012-12-10-00:11
some_file.exe2013-03-01-23:11
some_file.exe2013-03-01-23:12
Essentially, but changing the file name to dates in the form to YYYY-MM-DD-HH:MM, a normal string sort (such as that done by ls) will automatically group similar files together sorted by date-time.
The ls -r on the last line simply lists all files within the current working directly print the results in reverse order so newer archive files appear first.
We pass the output through grep to extract only files that are in the correct format.
The output of that command combination is then looped through (see the while loop) and we can simply start deleting after 3 occurrences of the same filename (minus the date portion).
This pipeline will get you the 3 newest files (by modification time) in the current dir
stat -c $'%Y\t%n' file* | sort -n | tail -3 | cut -f 2-
To get all but the 3 newest:
stat -c $'%Y\t%n' file* | sort -rn | tail -n +4 | cut -f 2-

How to tell how many files match description with * in unix

Pretty simple question: say I have a set of files:
a1.txt
a2.txt
a3.txt
b1.txt
And I use the following command:
ls a*.txt
It will return:
a1.txt a2.txt a3.txt
Is there a way in a bash script to tell how many results will be returned when using the * pattern. In the above example if I were to use a*.txt the answer should be 3 and if I used *1.txt the answer should be 2.
Comment on using ls:
I see all the other answers attempt this by parsing the output of
ls. This is very unpredictable because this breaks when you have
file names with "unusual characters" (e.g. spaces).
Another pitfall would be, it is ls implementation dependent. A
particular implementation might format output differently.
There is a very nice discussion on the pitfalls of parsing ls output on the bash wiki maintained by Greg Wooledge.
Solution using bash arrays
For the above reasons, using bash syntax would be the more reliable option. You can use a glob to populate a bash array with all the matching file names. Then you can ask bash the length of the array to get the number of matches. The following snippet should work.
files=(a*.txt) && echo "${#files[#]}"
To save the number of matches in a variable, you can do:
files=(a*.txt)
count="${#files[#]}"
One more advantage of this method is you now also have the matching files in an array which you can iterate over.
Note: Although I keep repeating bash syntax above, I believe the above solution applies to all sh-family of shells.
You can't know ahead of time, but you can count how many results are returned. I.e.
ls -l *.txt | wc -l
ls -l will display the directory entries matching the specified wildcard, wc -l will give you the count.
You can save the value of this command in a shell variable with either
num=$(ls * | wc -l)
or
num=`ls -l *.txt | wc -l`
and then use $num to access it. The first form is preferred.
You can use ls in combination with wc:
ls a*.txt | wc -l
The ls command lists the matching files one per line, and wc -l counts the number of lines.
I like suvayu's answer, but there's no need to use an array:
count() { echo $#; }
count *
In order to count files that might have unpredictable names, e.g. containing new-lines, non-printable characters etc., I would use the -print0 option of find and awk with RS='\0':
num=$(find . -maxdepth 1 -print0 | awk -v RS='\0' 'END { print NR }')
Adjust the options to find to refine the count, e.g. if the criteria is files starting with a lower-case a with .txt extension in the current directory, use:
find . -type f -name 'a*.txt' -maxdepth 1 -print0

Resources