Finding human-readable files on Unix

Finding human-readable files on Unix - linux

I'd like to find human-readable files on my Linux machine without a file extension constraint. Those files should be of human sensing files like text, configuration, HTML, source-code etc. files. Is there a way to filter and locate?

Use:
find /dir/to/search -type f | xargs file | grep text
find will give you a list of files.
xargs file will run the file command on each of the lines from the piped input.

find and file are your friends here:
find /dir/to/search -type f -exec sh -c 'file -b {} | grep text &>/dev/null' \; -print
This will find any files (NOTE: it will not find symlinks directories sockets, etc., only regular files) in /dir/to/search and run sh -c 'file -b {} | grep text &>/dev/null' ; which looks at the type of file and looks for text in the description. If this returns true (i.e., text is in the line) then it prints the filename.
NOTE: using the -b flag to file means that the filename is not printed and therefore cannot create any issues with the grep. E.g., without the -b flag the binary file gettext would erroneously be detected as a textfile.
For example,
root#osdevel-pete# find /bin -exec sh -c 'file -b {} | grep text &>/dev/null' \; -print
/bin/gunzip
/bin/svnshell.sh
/bin/unicode_stop
/bin/unicode_start
/bin/zcat
/bin/redhat_lsb_init
root#osdevel-pete# find /bin -type f -name *text*
/bin/gettext
If you want to look in compressed files use the --uncompress flag to file. For more information and flags to file see man file.

This should work fine, too:
file_info=`file "$file_name"` # First reading the file info string which should have the words "ASCII" or "Unicode" if it's a readable file
if grep -q -i -e "ASCII" -e "Unicode"<<< "$file_info"; then
echo "file is readable"
fi

Related

Find all directories containing a file that contains a keyword in linux

In my hierarchy of directories I have many text files called STATUS.txt. These text files each contain one keyword such as COMPLETE, WAITING, FUTURE or OPEN. I wish to execute a shell command of the following form:
./mycommand OPEN
which will list all the directories that contain a file called STATUS.txt, where this file contains the text "OPEN"
In future I will want to extend this script so that the directories returned are sorted. Sorting will determined by a numeric value stored the file PRIORITY.txt, which lives in the same directories as STATUS.txt. However, this can wait until my competence level improves. For the time being I am happy to list the directories in any order.
I have searched Stack Overflow for the following, but to no avail:
unix filter by file contents
linux filter by file contents
shell traverse directory file contents
bash traverse directory file contents
shell traverse directory find
bash traverse directory find
linux file contents directory
unix file contents directory
linux find name contents
unix find name contents
shell read file show directory
bash read file show directory
bash directory search
shell directory search
I have tried the following shell commands:
This helps me identify all the directories that contain STATUS.txt
$ find ./ -name STATUS.txt
This reads STATUS.txt for every directory that contains it
$ find ./ -name STATUS.txt | xargs -I{} cat {}
This doesn't return any text, I was hoping it would return the name of each directory
$ find . -type d | while read d; do if [ -f STATUS.txt ]; then echo "${d}"; fi; done

... or the other way around:
find . -name "STATUS.txt" -exec grep -lF "OPEN" \{} +
If you want to wrap that in a script, a good starting point might be:
#!/bin/sh
[ $# -ne 1 ] && echo "One argument required" >&2 && exit 2
find . -name "STATUS.txt" -exec grep -lF "$1" \{} +
As pointed out by #BroSlow, if you are looking for directories containing the matching STATUS.txt files, this might be more what you are looking for:
fgrep --include='STATUS.txt' -rl 'OPEN' | xargs -L 1 dirname
Or better
fgrep --include='STATUS.txt' -rl 'OPEN' |
sed -e 's|^[^/]*$|./&|' -e 's|/[^/]*$||'
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# simulate `xargs -L 1 dirname` using `sed`
# (no trailing `\`; returns `.` for path without dir part)

Maybe you can try this:
grep -rl "OPEN" . --include='STATUS.txt'| sed 's/STATUS.txt//'
where grep -r means recursive , -l means only list the files matching, '.' is the directory location. You can pipe it to sed to remove the file name.
You can then wrap this in a bash script file where you can pass in keywords such as 'OPEN', 'FUTURE' as an argument.
#!/bin/bash
grep -rl "$1" . --include='STATUS.txt'| sed 's/STATUS.txt//'

Try something like this
find -type f -name "STATUS.txt" -exec grep -q "OPEN" {} \; -exec dirname {} \;
or in a script
#!/bin/bash
(($#==1)) || { echo "Usage: $0 <pattern>" && exit 1; }
find -type f -name "STATUS.txt" -exec grep -q "$1" {} \; -exec dirname {} \;

You could use grep and awk instead of find:
grep -r OPEN * | awk '{split($1, path, ":"); print path[1]}' | xargs -I{} dirname {}
The above grep will list all files containing "OPEN" recursively inside you dir structure. The result will be something like:
dir_1/subdir_1/STATUS.txt:OPEN
dir_2/subdir_2/STATUS.txt:OPEN
dir_2/subdir_3/STATUS.txt:OPEN
Then the awk script will split this output at the colon and print the first part of it (the dir path).
dir_1/subdir_1/STATUS.txt
dir_2/subdir_2/STATUS.txt
dir_2/subdir_3/STATUS.txt
The dirname will then return only the directory path, not the file name, which I suppose it what you want.
I'd consider using Perl or Python if you want to evolve this further, though, as it might get messier if you want to add priorities and sorting.

Taking up the accepted answer, it does not output a sorted and unique directory list. At the end of the "find" command, add:
| sort -u
or:
| sort | uniq
to get the unique list of the directories.
Credits go to Get unique list of all directories which contain a file whose name contains a string.

IMHO you should write a Python script which:
Examines your directory structure and finds all files named STATUS.txt.
For each found file:
reads the file and executes mycommand depending on what the file contains.
If you want to extend the script later with sorting, you can find all the interesting files first, save them to a list, sort the list and execute the commands on the sorted list.
Hint: http://pythonadventures.wordpress.com/2011/03/26/traversing-a-directory-recursively/

find and copy all images in directory using terminal linux mint, trying to understand syntax

OS Linux Mint
Like the title says finally I would like to find and copy all images in a directory.
I found:
find all jpg (or JPG) files in a directory and copy them into the folder /home/joachim/neu2:
find . -iname \*.jpg -print0 | xargs -I{} -0 cp -v {} /home/joachim/neu2
and
find all image files in a direcotry:
find . -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image'
My problem is first of all, I don't really understand the syntax. Could someone explain the code?
And secondly can someone connect the two codes for generating a code that does what I want ;)
Greetings and thanks in advance!

First, understand that the pipe "|" links commands piping the output of the first into the second as an argument. Your two shell codes both pipe output of the find command into other commands (grep and xargs). Let's look at those commands one after another:
First command: find
find is a program to "search for files in a directory hierarchy" (that is the explanation from find's man page). The syntax is (in this case)
find <search directory> <search pattern> <action>
In both cases the search directory is . (that is the current directory). Note that it does not just search the current directory but all its subdirectories as well (the directory hierarchy).
The search pattern accepts options -name (meaning it searches for files the name of which matches the pattern given as an argument to this option) or -iname (same as name but case insensitive) among others.
The action pattern may be -print0 (print the exact filename including its position in the given search directory, i.e. the relative or absolute path to the file) or -exec (execute the given command on the file(s), the command is to be ended with ";" and every instance of "{}" is replaced by the filename).
That is, the first shell code (first part, left of the pipe)
find . -iname \*.jpg -print0
searches all files with ending ".jpg" in the current directory hierarchy and prints their paths and names. The second one (first part)
find . -name '*' -exec file {} \;
finds all files in the current directory hierarchy and executes
file <filename>
on them. File is another command that determines and prints the file type (have a look at the man page for details, man file).
Second command: xargs
xargs is a command that "builds and exectues command lines from standard input" (man xargs), i.e. from the find output that is piped into xargs. The command that it builds and executes is in this case
cp -v {} /home/joachim/neu2"
Option -I{} defines the replacement string, i.e. every instance of {} in the command is to be replaced by the input it gets from file (that is, the filenames). Option -0 defines that input items are not terminated (seperated) by whitespace or newlines but only by a null character. This seems to be necessary when using and the standard way to deal with find output as xargs input.
The command that is built and executed is then of course the copy command with option -v (verbose) and it copies each of the filenames it gets from find to the directory.
Third command: grep
grep filters its input giving only those lines or strings that match a particular output pattern. Option -o tells grep to print only the matching string, not the entire line (see man grep), -P tells it to interpret the following pattern as a perl regexp pattern. In perl regex, ^ is the start of the line, .+ is any arbitrary string, this arbitrary should then be followed by a colon, a space, a number of alphanumeric characters (in perl regex denoted \w+) a space and the string "image". Essentially this grep command filters the file output to only output the filenames that are image files. (Read about perl regex's for instance here: http://www.comp.leeds.ac.uk/Perl/matching.html )
The command you actually wanted
Now what you want to do is (1) take the output of the second shell command (which lists the image files), (2) bring it into the appropriate form and (3) pipe it into the xargs command from the first shell command line (which then builds and executes the copy command you wanted). So this time we have a three (actually four) stage shell command with two pipes. Not a problem. We already have stages (1) and (3) (though in stage (3) we need to leave out the -0 option because the input is not find output any more; we need it to treat newlines as item seperators).
Stage (2) is still missing. I suggest using the cut command for this. cut changes strings py splitting them into different fields (seperated by a delimiter character in the original string) that can then be rearranged. I will choose ":" as the delimiter character (this ends the filename in the grep output, option -d':') and tell it to give us just the first field (option -f1, essentialls: print only the filename, not the part that comes after the ":"), i.e. stage (2) would then be
cut -d':' -f1
And the entire command you wanted will then be:
find . -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image' | cut -d':' -f1 | xargs -I{} cp -v {} /home/joachim/neu2
Note that you can find all the man pages for instance here: http://www.linuxmanpages.com

I figured out a command only using awk that does the job as well:
find . -name '*' -exec file {} \; |
awk '{
if ($3=="image"){
print substr($1, 0, length($1)-1);
system("cp " substr($1, 0, length($1)-1) " /home/joachim/neu2" )
}
}'
the substr($1, 0, length($1)-1) is needed because in first column file returns name;

The above answer is really good. but it could take longer if it a huge directory.
here is a shorter version of it , if you already know your file extension
find . -name \*.jpg | cut -d':' -f1 | xargs -I{} cp --parents -v {} ~/testimage/

Here's another one which works like a charm.
It adds the EPOCH time to prevent overwriting files with the same name.
cd /media/myhome/'Local station'/
find . -path ./jpg -prune -o -type f -iname '*.jpg' -exec sh -c '
for file do
newname="${file##*/}"
newname="${newname%.jpg}"
mv -T -- "$file" "/media/myhome/Local station/jpg/$newname-$(date +%s).jpg"
done
' find-sh {} +
cd ~/
It's been designed by Kamil in this post here.

Find a specific type file from a directory:
find /home/user/find/data/ -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image'
Copy specific type of file from one directory to another directory:
find /home/user/find/data/ -name '*' -exec file {} \; | grep -o -P '^.+: \w+ image' | cut -d':' -f1 | xargs -I{} cp -v {} /home/user/copy/data/

Linux search text string from .bz2 files recursively in subdirectories

I have a case where multiple .bz2 files are situated in subdirectories. And I want to search for a text, from all files, using bzcat and grep command linux commands.
I am able to search one-one file by using the following command:
bzcat <filename.bz2> | grep -ia 'text string' | less
But I now I need to do the above for all files in subdirectories.

You can use bzgrep instead of bzcat and grep. This is faster.
To grep recursively in a directory tree use find:
find -type f -name '*.bz2' -execdir bzgrep "pattern" {} \;
find is searching recursively for all files with the *.bz2 extension and applies the command specified with -execdir to them.

There are several methods:
bzgrep regexp $(find -name \*.bz2)
This method will work if number of the found files is not very big (and they have no special characters in the pathes). Otherwise you better use this one:
find -name \*.bz2 -exec bzgrep regexp {} /dev/null \;
Please note /dev/null in the second method. You use it to make bzgrep print the filename,
where the regexp was found.

Just try to use:
bzgrep --help
grep through bzip2 files
Usage: bzgrep [grep_options] pattern [files]
For example, I need grep information from list of files by number 1941974:
'billing_log_1.bz'
'billing_log_2.bz'
'billing_log_3.bz'
'billing_log_4.bz'
'billing_log_5.bz'
What can I do?
bzgrep '1941974' billing_log_1

Continuous your code with fixes by bzcat:
find . -type f -name "*.bz2" |while read file
do
bzcat $file | grep -ia 'text string' | less
done

Find multiple files and rename them in Linux

I am having files like a_dbg.txt, b_dbg.txt ... in a Suse 10 system. I want to write a bash shell script which should rename these files by removing "_dbg" from them.
Google suggested me to use rename command. So I executed the command rename _dbg.txt .txt *dbg* on the CURRENT_FOLDER
My actual CURRENT_FOLDER contains the below files.
CURRENT_FOLDER/a_dbg.txt
CURRENT_FOLDER/b_dbg.txt
CURRENT_FOLDER/XX/c_dbg.txt
CURRENT_FOLDER/YY/d_dbg.txt
After executing the rename command,
CURRENT_FOLDER/a.txt
CURRENT_FOLDER/b.txt
CURRENT_FOLDER/XX/c_dbg.txt
CURRENT_FOLDER/YY/d_dbg.txt
Its not doing recursively, how to make this command to rename files in all subdirectories. Like XX and YY I will be having so many subdirectories which name is unpredictable. And also my CURRENT_FOLDER will be having some other files also.

You can use find to find all matching files recursively:
find . -iname "*dbg*" -exec rename _dbg.txt .txt '{}' \;
EDIT: what the '{}' and \; are?
The -exec argument makes find execute rename for every matching file found. '{}' will be replaced with the path name of the file. The last token, \; is there only to mark the end of the exec expression.
All that is described nicely in the man page for find:
-exec utility [argument ...] ;
True if the program named utility returns a zero value as its
exit status. Optional arguments may be passed to the utility.
The expression must be terminated by a semicolon (``;''). If you
invoke find from a shell you may need to quote the semicolon if
the shell would otherwise treat it as a control operator. If the
string ``{}'' appears anywhere in the utility name or the argu-
ments it is replaced by the pathname of the current file.
Utility will be executed from the directory from which find was
executed. Utility and arguments are not subject to the further
expansion of shell patterns and constructs.

For renaming recursively I use the following commands:
find -iname \*.* | rename -v "s/ /-/g"

small script i wrote to replace all files with .txt extension to .cpp extension under /tmp and sub directories recursively
#!/bin/bash
for file in $(find /tmp -name '*.txt')
do
mv $file $(echo "$file" | sed -r 's|.txt|.cpp|g')
done

with bash:
shopt -s globstar nullglob
rename _dbg.txt .txt **/*dbg*

find -execdir rename also works for non-suffix replacements on basenames
https://stackoverflow.com/a/16541670/895245 works directly only for suffixes, but this will work for arbitrary regex replacements on basenames:
PATH=/usr/bin find . -depth -execdir rename 's/_dbg.txt$/_.txt' '{}' \;
or to affect files only:
PATH=/usr/bin find . -type f -execdir rename 's/_dbg.txt$/_.txt' '{}' \;
-execdir first cds into the directory before executing only on the basename.
Tested on Ubuntu 20.04, find 4.7.0, rename 1.10.
Convenient and safer helper for it
find-rename-regex() (
set -eu
find_and_replace="$1"
PATH="$(echo "$PATH" | sed -E 's/(^|:)[^\/][^:]*//g')" \
find . -depth -execdir rename "${2:--n}" "s/${find_and_replace}" '{}' \;
)
GitHub upstream.
Sample usage to replace spaces ' ' with hyphens '-'.
Dry run that shows what would be renamed to what without actually doing it:
find-rename-regex ' /-/g'
Do the replace:
find-rename-regex ' /-/g' -v
Command explanation
The awesome -execdir option does a cd into the directory before executing the rename command, unlike -exec.
-depth ensure that the renaming happens first on children, and then on parents, to prevent potential problems with missing parent directories.
-execdir is required because rename does not play well with non-basename input paths, e.g. the following fails:
rename 's/findme/replaceme/g' acc/acc
The PATH hacking is required because -execdir has one very annoying drawback: find is extremely opinionated and refuses to do anything with -execdir if you have any relative paths in your PATH environment variable, e.g. ./node_modules/.bin, failing with:
find: The relative path ‘./node_modules/.bin’ is included in the PATH environment variable, which is insecure in combination with the -execdir action of find. Please remove that entry from $PATH
See also: https://askubuntu.com/questions/621132/why-using-the-execdir-action-is-insecure-for-directory-which-is-in-the-path/1109378#1109378
-execdir is a GNU find extension to POSIX. rename is Perl based and comes from the rename package.
Rename lookahead workaround
If your input paths don't come from find, or if you've had enough of the relative path annoyance, we can use some Perl lookahead to safely rename directories as in:
git ls-files | sort -r | xargs rename 's/findme(?!.*\/)\/?$/replaceme/g' '{}'
I haven't found a convenient analogue for -execdir with xargs: https://superuser.com/questions/893890/xargs-change-working-directory-to-file-path-before-executing/915686
The sort -r is required to ensure that files come after their respective directories, since longer paths come after shorter ones with the same prefix.
Tested in Ubuntu 18.10.

Script above can be written in one line:
find /tmp -name "*.txt" -exec bash -c 'mv $0 $(echo "$0" | sed -r \"s|.txt|.cpp|g\")' '{}' \;

If you just want to rename and don't mind using an external tool, then you can use rnm. The command would be:
#on current folder
rnm -dp -1 -fo -ssf '_dbg' -rs '/_dbg//' *
-dp -1 will make it recursive to all subdirectories.
-fo implies file only mode.
-ssf '_dbg' searches for files with _dbg in the filename.
-rs '/_dbg//' replaces _dbg with empty string.
You can run the above command with the path of the CURRENT_FOLDER too:
rnm -dp -1 -fo -ssf '_dbg' -rs '/_dbg//' /path/to/the/directory

You can use this below.
rename --no-act 's/\.html$/\.php/' *.html */*.html

This command worked for me. Remember first to install the perl rename package:
find -iname \*.* | grep oldname | rename -v "s/oldname/newname/g

To expand on the excellent answer #CiroSantilliПутлерКапут六四事 : do not match files in the find that we don't have to rename.
I have found this to improve performance significantly on Cygwin.
Please feel free to correct my ineffective bash coding.
FIND_STRING="ZZZZ"
REPLACE_STRING="YYYY"
FIND_PARAMS="-type d"
find-rename-regex() (
set -eu
find_and_replace="${1}/${2}/g"
echo "${find_and_replace}"
find_params="${3}"
mode="${4}"
if [ "${mode}" = 'real' ]; then
PATH="$(echo "$PATH" | sed -E 's/(^|:)[^\/][^:]*//g')" \
find . -depth -name "*${1}*" ${find_params} -execdir rename -v "s/${find_and_replace}" '{}' \;
elif [ "${mode}" = 'dryrun' ]; then
echo "${mode}"
PATH="$(echo "$PATH" | sed -E 's/(^|:)[^\/][^:]*//g')" \
find . -depth -name "*${1}*" ${find_params} -execdir rename -n "s/${find_and_replace}" '{}' \;
fi
)
find-rename-regex "${FIND_STRING}" "${REPLACE_STRING}" "${FIND_PARAMS}" "dryrun"
# find-rename-regex "${FIND_STRING}" "${REPLACE_STRING}" "${FIND_PARAMS}" "real"

In case anyone is comfortable with fd and rnr, the command is:
fd -t f -x rnr '_dbg.txt' '.txt'
rnr only command is:
rnr -f -r '_dbg.txt' '.txt' *
rnr has the benefit of being able to undo the command.

On Ubuntu (after installing rename), this simpler solution worked the best for me. This replaces space with underscore, but can be modified as needed.
find . -depth | rename -d -v -n "s/ /_/g"
The -depth flag is telling find to traverse the depth of a directory first, which is good because I want to rename the leaf nodes first.
The -d flag on rename tells it to only rename the filename component of the path. I don't know how general the behavior is but on my installation (Ubuntu 20.04), it could be the file or the directory as long as it is the leaf node of the path.
I recommend the -n (no action) flag first along with -v, so you can see what would get renamed and how.
Using the two flags together, it renames all the files in a directory first and then the directory itself. Working backwards. Which is exactly what I needed.

classic solution:
for f in $(find . -name "*dbg*"); do mv $f $(echo $f | sed 's/_dbg//'); done

How to list specific type of files in recursive directories in shell?

How can we find specific type of files i.e. doc pdf files present in nested directories.
command I tried:
$ ls -R | grep .doc
but if there is a file name like alok.doc.txt the command will display that too which is obviously not what I want. What command should I use instead?

If you are more confortable with "ls" and "grep", you can do what you want using a regular expression in the grep command (the ending '$' character indicates that .doc must be at the end of the line. That will exclude "file.doc.txt"):
ls -R |grep "\.doc$"
More information about using grep with regular expressions in the man.

ls command output is mainly intended for reading by humans. For advanced querying for automated processing, you should use more powerful find command:
find /path -type f \( -iname "*.doc" -o -iname "*.pdf" \)
As if you have bash 4.0++
#!/bin/bash
shopt -s globstar
shopt -s nullglob
for file in **/*.{pdf,doc}
do
echo "$file"
done

find . | grep "\.doc$"
This will show the path as well.

Some of the other methods that can be used:
echo *.{pdf,docx,jpeg}
stat -c %n * | grep 'pdf\|docx\|jpeg'

We had a similar question. We wanted a list - with paths - of all the config files in the etc directory. This worked:
find /etc -type f \( -iname "*.conf" \)
It gives a nice list of all the .conf file with their path. Output looks like:
/etc/conf/server.conf
But, we wanted to DO something with ALL those files, like grep those files to find a word, or setting, in all the files. So we use
find /etc -type f \( -iname "*.conf" \) -print0 | xargs -0 grep -Hi "ServerName"
to find via grep ALL the config files in /etc that contain a setting like "ServerName" Output looks like:
/etc/conf/server.conf: ServerName "default-118_11_170_172"
Hope you find it useful.
Sid

Similarly if you prefer using the wildcard character * (not quite like the regex suggestions) you can just use ls with both the -l flag to list one file per line (like grep) and the -R flag like you had. Then you can specify the files you want to search for with *.doc
I.E. Either
ls -l -R *.doc
or if you want it to list the files on fewer lines.
ls -R *.doc

If you have files with extensions that don't match the file type, you could use the file utility.
find $PWD -type f -exec file -N \{\} \; | grep "PDF document" | awk -F: '{print $1}'
Instead of $PWD you can use the directory you want to start the search in. file prints even out he PDF version.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string