Bash find command not operating in depth-first-search - linux

So I read that the find command in Bash should operate with DFS, but I don't see it happening.
My path tree:
- tests_ex22
- first
- middle
- story2.txt
- story1.txt
- last
- story3.txt
I run the following command:
find $1 -name "*.$2" -exec grep -wi $3 {} \;
And to my surprise, elements in "middle" are printed before elements in "first".
When find arrives in a new directory, I want it to look in the current dir before moving to a new dir. But, I do want it to move in a DFS way.
Why is this happening? How can I solve it? (ofc, I don't have to use find).

middle is an element of first. It's not processing middle before elements in first; it's processing middle as part of the processing of first's elements.
It sounds like you want find to sort entries and process all non-directory entries before directory entries. There is no such mode, I'm afraid. In general find processes directory entries in the order it finds them, which is fairly arbitrary. If it were to process them in a particular order—say, alphabetical, or files before subdirectories—it would be required to sort entries. find avoids that overhead. It does not sort entries, not even as an option.
This is in contrast to ls, which does indeed sort its output. ls is designed to be more of a human-friendly display tool whereas find is for scripting.
Sort by depth
If you're mainly printing file names you could induce find to print each entry's depth along with its path and then manually sort by depth. Something like this:
find "$1" -name "*.$2" -printf '%d\t%p\n' | sort -V | cut -f 2-
You'll have to adapt this to your use case. It's tricky to fit the grep in here.
Manual loop
Or you could write a recursive search by hand. Here are some starting points:
breadth-first option in the Linux find utility?
How do I recursively list all directories at a location, breadth-first?

Your example shows find operating in a depth first manner. If you want breadth first manner there's a tool that's compatible with find but breadth first that is called bfs.
https://github.com/tavianator/bfs

Related

How to I construct a pipe in Linux to sort python files from other files on date and name?

Currently I'm learning about pipes. I want to achieve the following:
Construct a shell pipeline that finds all files with extension “.py”in
your homedir, sort on size and print both size, date and name using 1)
find, tr, awk/cut and 2) ls/sort.
I succeeded in separating the python files from the other files with:
find . -type f -name "*.py"
But I don't know how to proceed to sort on the other criteria. I actually think that what I did is wrong since redirecting statements usually begin with cat ... |
Question: How do I construct a shell pipeline that finds all files with extension .py, sorts them on size and prints size, date and name?
Ter

recursively finding size of binary directories in linux

Could you recommend a good utility/BASH function that scans a directory and outputs it's size?
I need to find the size of the executables in the binary directories: /usr/bin and /bin in order to find their average and median.
I am not sure whether it's best to use du command or ls?
What's the easiest and efficient way to find the median and average of a directory in Linux?
PS: It should be recursive as there are several directories inside.
This is a two step process. First find the disk usage of every file and then calculate the values.
For the first du is clearly my favorite.
find /usr/bin -type f -exec du '{}' '+'
This will search ever file (-type f) and will append ('+') its filename ('{}') to an invokation (-exec) of du.
The result will be a tab separated list of usage (in blocks IIRC) and filename.
Now comes the second part (here for the average). This list are we going to feed into awk and let it sum up and divide by the number of rows
{ sum = $1 } END { print "avg: " sum/NR }
The first block is going to be executed every line and will add the value of the first (tab separated) column to the variable sum. The other block is prefixed with END meaning that it will get executed when the stdin is EOF. NR is a special variable meaning the number of rows.
So the finished command looks like:
find /usr/bin -type f -exec du '{}' '+' | awk '{ sum += $1 } END { print "Avg: " sum/NR }'
Now go read about find, awk and shell pipelines. Those things will make your life considerably easier when you have to deal with linux shell stuff. Also basic knowledge about line buffering and standard IO streams is helpful.

Getting the latest file in shell with YYYYMMDD_HHMMSS.csv.gz format

1)I have set of files in a directory in shell and i want go get the latest file depending on the time stamp in the file name.
2)For Example:
test1_20180823_121545.csv.gz
test2_20180822_191545.csv.gz
test3_20180823_192050.csv.gz
test4_20180823_100510.csv.gz
test4_20180823_191040.csv.gz
3)
From the above given files based on their time and date extension. My output should be test3_20180823_192050.csv.gz
Using find and sort:
find /path/to/mydirectory -type f | sort -t_ -k2,3 | tail -1
Option for the sort command are -t for the delimiter and -k for selecting the key on which the sort is done.
tail is to get last entry from the sorted list.
if files have also corresponding modification times (shown by ls -l) then you can list them by modification times in reverse order and get the last one
ls -1rt | tail -1
But if you can not rely on this, than you need to write the script (e.g. perl). You would get file list to array then extract time stamp to other array, convert timestamps to epoch time (which is easy to sort) to other array, sort while sorting also file list. Maybe hashes can help with it. Then print last one.
You can try to write it, if you will have issues, someone here can correct you.

egrep not writing to a file

I am using the following command in order to extract domain names & the full domain extension from a file. Ex: www.abc.yahoo.com, www.efg.yahoo.com.us.
[a-z0-9\-]+\.com(\.[a-z]{2})?' source.txt | sort | uniq | sed -e 's/www.//'
> dest.txt
The command write correctly when I specify small maximum parameter -m 100 after the source.txt. The problem if I didn't specify, or if I specified a huge number. Although, I could write to files with grep (not egrep) before with huge numbers similar to what I'm trying now and that was successful. I also check the last modified date and time during the command being executed, and it seems there is no modification happening in the destination file. What could be the problem ?
As I mentioned in your earlier question, it's probably not an issue with egrep, but that your file is too big and that sort won't output anything (to uniq) until egrep is done. I suggested that you split the files into manageable chucks using the split command. Something like this:
split -l 10000000 source.txt split_source.
This will split the source.txt file into 10 million line chunks called split_source.a, split_source.b, split_source.c etc. And you can run the entire command on each one of those files (and maybe changing the pipe to append at the end: >> dest.txt).
The problem here is that you can get duplicates across multiple files, so at the end you may need to run
sort dest.txt | uniq > dest_uniq.txt
Your question is missing information.
That aside, a few thoughts. First, to debug and isolate your problem:
Run the egrep <params> | less so you can see what egreps doing, and eliminate any problem from sort, uniq, or sed (my bets on sort).
How big is your input? Any chance sort is dying from too much input?
Gonna need to see the full command to make further comments.
Second, to improve your script:
You may want to sort | uniq AFTER sed, otherwise you could end up with duplicates in your result set, AND an unsorted result set. Maybe that's what you want.
Consider wrapping your regular expressions with "^...$", if it's appropriate to establish beginning of line (^) and end of line ($) anchors. Otherwise you'll be matching portions in the middle of a line.

What is the fastest way to find all the file with the same inode?

The only way I know is:
find /home -xdev -samefile file1
But it's really slow. I would like to find a tool like locate.
The real problems comes when you have a lot of file, I suppose the operation is O(n).
There is no mapping from inode to name. The only way is to walk the entire filesystem, which as you pointed out is O(number of files). (Actually, I think it's θ(number of files)).
I know this is an old question, but many versions of find have an inum option to match a known inode number easily. You can do this with the following command:
find . -inum 1234
This will still run through all files if allowed to do-so, but once you get a match you can always stop it manually; I'm not sure if find has an option to stop after a single match (perhaps with an -exec statement?)
This is much easier than dumping output to a file, sorting etc. and other methods, so should be used when available.
Here's a way:
Use find -printf "%i:\t%p or similar to create a listing of all files prefixed by inode, and output to a temporary file
Extract the first field - the inode with ':' appended - and sort to bring duplicates together and then restrict to duplicates, using cut -f 1 | sort | uniq -d, and output that to a second temporary file
Use fgrep -f to load the second file as a list of strings to search and search the first temporary file.
(When I wrote this, I interpreted the question as finding all files which had duplicate inodes. Of course, one could use the output of the first half of this as a kind of index, from inode to path, much like how locate works.)
On my own machine, I use these kinds of files a lot, and keep them sorted. I also have a text indexer application which can then apply binary search to quickly find all lines that have a common prefix. Such a tool ends up being quite useful for jobs like this.
What I'd typically do is: ls -i <file> to get the inode of that file, and then find /dir -type f -inum <inode value> -mount. (You want the -mount to avoid searching on different file systems, which is probably part of your performance issues.)
Other than that, I think that's about it.

Resources