recursively finding size of binary directories in linux - linux

Could you recommend a good utility/BASH function that scans a directory and outputs it's size?
I need to find the size of the executables in the binary directories: /usr/bin and /bin in order to find their average and median.
I am not sure whether it's best to use du command or ls?
What's the easiest and efficient way to find the median and average of a directory in Linux?
PS: It should be recursive as there are several directories inside.

This is a two step process. First find the disk usage of every file and then calculate the values.
For the first du is clearly my favorite.
find /usr/bin -type f -exec du '{}' '+'
This will search ever file (-type f) and will append ('+') its filename ('{}') to an invokation (-exec) of du.
The result will be a tab separated list of usage (in blocks IIRC) and filename.
Now comes the second part (here for the average). This list are we going to feed into awk and let it sum up and divide by the number of rows
{ sum = $1 } END { print "avg: " sum/NR }
The first block is going to be executed every line and will add the value of the first (tab separated) column to the variable sum. The other block is prefixed with END meaning that it will get executed when the stdin is EOF. NR is a special variable meaning the number of rows.
So the finished command looks like:
find /usr/bin -type f -exec du '{}' '+' | awk '{ sum += $1 } END { print "Avg: " sum/NR }'
Now go read about find, awk and shell pipelines. Those things will make your life considerably easier when you have to deal with linux shell stuff. Also basic knowledge about line buffering and standard IO streams is helpful.

Related

Bash find command not operating in depth-first-search

So I read that the find command in Bash should operate with DFS, but I don't see it happening.
My path tree:
- tests_ex22
- first
- middle
- story2.txt
- story1.txt
- last
- story3.txt
I run the following command:
find $1 -name "*.$2" -exec grep -wi $3 {} \;
And to my surprise, elements in "middle" are printed before elements in "first".
When find arrives in a new directory, I want it to look in the current dir before moving to a new dir. But, I do want it to move in a DFS way.
Why is this happening? How can I solve it? (ofc, I don't have to use find).
middle is an element of first. It's not processing middle before elements in first; it's processing middle as part of the processing of first's elements.
It sounds like you want find to sort entries and process all non-directory entries before directory entries. There is no such mode, I'm afraid. In general find processes directory entries in the order it finds them, which is fairly arbitrary. If it were to process them in a particular order—say, alphabetical, or files before subdirectories—it would be required to sort entries. find avoids that overhead. It does not sort entries, not even as an option.
This is in contrast to ls, which does indeed sort its output. ls is designed to be more of a human-friendly display tool whereas find is for scripting.
Sort by depth
If you're mainly printing file names you could induce find to print each entry's depth along with its path and then manually sort by depth. Something like this:
find "$1" -name "*.$2" -printf '%d\t%p\n' | sort -V | cut -f 2-
You'll have to adapt this to your use case. It's tricky to fit the grep in here.
Manual loop
Or you could write a recursive search by hand. Here are some starting points:
breadth-first option in the Linux find utility?
How do I recursively list all directories at a location, breadth-first?
Your example shows find operating in a depth first manner. If you want breadth first manner there's a tool that's compatible with find but breadth first that is called bfs.
https://github.com/tavianator/bfs

How to I construct a pipe in Linux to sort python files from other files on date and name?

Currently I'm learning about pipes. I want to achieve the following:
Construct a shell pipeline that finds all files with extension “.py”in
your homedir, sort on size and print both size, date and name using 1)
find, tr, awk/cut and 2) ls/sort.
I succeeded in separating the python files from the other files with:
find . -type f -name "*.py"
But I don't know how to proceed to sort on the other criteria. I actually think that what I did is wrong since redirecting statements usually begin with cat ... |
Question: How do I construct a shell pipeline that finds all files with extension .py, sorts them on size and prints size, date and name?
Ter

Getting the latest file in shell with YYYYMMDD_HHMMSS.csv.gz format

1)I have set of files in a directory in shell and i want go get the latest file depending on the time stamp in the file name.
2)For Example:
test1_20180823_121545.csv.gz
test2_20180822_191545.csv.gz
test3_20180823_192050.csv.gz
test4_20180823_100510.csv.gz
test4_20180823_191040.csv.gz
3)
From the above given files based on their time and date extension. My output should be test3_20180823_192050.csv.gz
Using find and sort:
find /path/to/mydirectory -type f | sort -t_ -k2,3 | tail -1
Option for the sort command are -t for the delimiter and -k for selecting the key on which the sort is done.
tail is to get last entry from the sorted list.
if files have also corresponding modification times (shown by ls -l) then you can list them by modification times in reverse order and get the last one
ls -1rt | tail -1
But if you can not rely on this, than you need to write the script (e.g. perl). You would get file list to array then extract time stamp to other array, convert timestamps to epoch time (which is easy to sort) to other array, sort while sorting also file list. Maybe hashes can help with it. Then print last one.
You can try to write it, if you will have issues, someone here can correct you.

Bash recursive similarities between directories content

I am looking for a bash command/script that will do the following:
Having two directory structures with different structure and file names
To find all lines in one structure that is the same as a line in another file in the other directory structure
E.g. line 56 "int archiveHex = 0x.." in file1.cpp is the same as same as line 89 of fileArchive.cpp. Of course the line numbers are not required at that stage the line content is good enought.
Long story is I do have two projects both quite big and I want to see does anyone used GPL code from one of the projects into his commercial product. However names of files and directory structure is changed but I see similarities and I am sure they copied something.
I found this two related questions:
How to compare two text files for the same exact text using BASH?
so it uses GREP but you have to pass the 2 files and cannot work
recursively.
Also I found
https://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff this as a way to use DIFF but for similarities not differences.
And also I found for the recursive part this question
https://askubuntu.com/questions/111495/how-to-diff-multiple-files-across-directories
But anyway I don't know how to combine all of them. How would you do this?
This can be done with a bit of shell and Awk script. Read all the lines of the first directory's files into an array, then for each input line, see if it is a defined key in the array. (I'm filtering out whitespace lines to reduce false positives. Maybe add empty comments to the filter, too.) The array key is the contents of the line and the array key's value is a string which identifies the source file name and line number. We conveniently receive these as colon-separated values from grep -nr:
grep -nrv '^[[:space:]]*$' "$srcdir" |
awk -F : 'NR==FNR { a[substr($0, length($1 ":" $2 ":")+1)] = $1 ":" $2; next }
$0 in a { print FILENAME ":" FNR " matches " a[$0] ":" $0; result=1}
END { exit 1-result }' - $(find "$otherdir" -type f)
The Awk script is fundamentally very simple; NR==FNR is a common idiom which matches the first input file (here, standard input, the pipe from grep) which is where we obtain the values for the array a; for subsequent input files, we trigger if the input line is a key in the array. The associative array type of Awk is ideal here.
This assumes that you have no file names with colons or newlines in them. It also assumes that the find output is small enough to not trigger an "Argument list too long" error, though if it does, that will be somewhat easier to fix.

What is the fastest way to find all the file with the same inode?

The only way I know is:
find /home -xdev -samefile file1
But it's really slow. I would like to find a tool like locate.
The real problems comes when you have a lot of file, I suppose the operation is O(n).
There is no mapping from inode to name. The only way is to walk the entire filesystem, which as you pointed out is O(number of files). (Actually, I think it's θ(number of files)).
I know this is an old question, but many versions of find have an inum option to match a known inode number easily. You can do this with the following command:
find . -inum 1234
This will still run through all files if allowed to do-so, but once you get a match you can always stop it manually; I'm not sure if find has an option to stop after a single match (perhaps with an -exec statement?)
This is much easier than dumping output to a file, sorting etc. and other methods, so should be used when available.
Here's a way:
Use find -printf "%i:\t%p or similar to create a listing of all files prefixed by inode, and output to a temporary file
Extract the first field - the inode with ':' appended - and sort to bring duplicates together and then restrict to duplicates, using cut -f 1 | sort | uniq -d, and output that to a second temporary file
Use fgrep -f to load the second file as a list of strings to search and search the first temporary file.
(When I wrote this, I interpreted the question as finding all files which had duplicate inodes. Of course, one could use the output of the first half of this as a kind of index, from inode to path, much like how locate works.)
On my own machine, I use these kinds of files a lot, and keep them sorted. I also have a text indexer application which can then apply binary search to quickly find all lines that have a common prefix. Such a tool ends up being quite useful for jobs like this.
What I'd typically do is: ls -i <file> to get the inode of that file, and then find /dir -type f -inum <inode value> -mount. (You want the -mount to avoid searching on different file systems, which is probably part of your performance issues.)
Other than that, I think that's about it.

Resources