Get distinct extension list Linux - linux

I am new in Linux and currently I am facing a problem. I want to get list of extensions (.doc, .pdf) from a folder. I googled a lot and finally I get a solution which is given below :
find . -type f | awk -F. '!a[$NF]++{print $NF}'
I understand find . -type f, but unable to understand awk -F. '!a[$NF]++{print $NF}' what does it mean?
NF = Number of Fields in the current record
Can anyone explain?
Thanks in advance.

To answer your question what the awk line is doing :
As you already indicated, the line find . -type f returns a list of files located in the current directory. Eg.
./foo.ext1
./bar.ext2
./spam.ext2
./ham.ext3
./spam.ham.eggs
This list of files is send with a pipe to the command awk -F. '!a[$NF]++{print $NF}'. This awk line contains a lot of information. First of all you need to know that awk is a record parser where each record consists of a number of fields. The default record is a line while the default field separator is a sequence of spaces. So what does your awk line do now :
-F. :: this redefines the field separator to be a dot (.). From this point forward all lines in the example have now 2 fields (eg line 1 foo and ext1) while the last line has 3 fields (spam, ham and eggs).
NF :: this is an awk variable that returns the number of fields per record. It is clear that the extension is represented by the last field ($NF)
a[$NF] :: this is a array where the index is the extension. The default array value is zero unless you assign something to it.
a[$NF]++ :: this returns the current value of a[$NF] and increments the value with 1 after the return. Thus for line 1, a["ext1"]++ returns 0 and sets a["ext1"] to 1. While for line 3, a["ext2"]++ returns 1 and sets a["ext2"] to 2. This indicates that a[$NF] keeps track of the amount of times $NF appeared.
!a[$NF]++ :: this combines the logic of the above but checks if return value of a[$NF]++ is 0. If it is 0, return true otherwise return false. In case of line 2 of the example, This statement will return true because a["ext2"]++ has value 0. However, after the statement a["ext2"] has the value 1. When reading line 3, the statement will return false. In other words, have we seen $NF already? And while you answer this question with "yes" or "no" increment the count of $NF with one.
!a[$NF]++{print $NF} :: this combines everything. It essentially states, If !a[$NF]++ returns true, then print $NF, but before printing increment a[$NF] by one. Or in other words, If the field representing the extension ($NF) appears for the first time, print that field. If it has already appeared before, do nothing.
The incrementing of the array is important as it keeps track of what has been seen already. So line by line the following will happen
foo.ext1 => $NF="ext1", a["ext1"] is 0 so print $NF and set a["ext1"]=1
bar.ext2 => $NF="ext2", a["ext2"] is 0 so print $NF and set a["ext2"]=1
spam.ext2 => $NF="ext2", a["ext2"] is 1 so do not print and set a["ext2"]=2
ham.ext3 => $NF="ext3", a["ext3"] is 0 so print $NF and set a["ext3"]=1
spam.ham.eggs => $NF="eggs", a["eggs"] is 0 so print $NF and set a["eggs"]=1
The output is
ext1
ext2
ext3
eggs
General comments:
A file without any extensions al or not in a hidden directory (eg. ./path/to/awesome_filename_without_extension or ./path/to/.secret/filename_without_extension) or a part its full path printed as if it was the extension. The result however is meaning less, i.e.
/path/to/awesome_filename_without_extension
secret/awesome_filename_without_extension
This is best resolved as
find . -type f -exec basename -a '{}' + \
| awk -F. '((NF>1)&&(!a[$NF]++)){print $NF}'
Here the output of find is processed directly by basename which strips the directory from the filename. The awk line does one more check, do we have more then 1 field (i.e. is there an extension).

A very simple way of doing what you are attempting is to sort the output keeping only unique extensions, e.g.
find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort -u
if your sort doesn't support the -u option, then you can pipe the results of sort to uniq, e.g.
find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort | uniq
The -regex option limits the find selection to filenames with at least one ASCII character extension. However it will also pickup files without an extension if they contain a '.', e.g. foo.bar.fatcat would result in fatcat being included in the list.
You could adjust the regular expression to meet your needs. If your version of find supports posix-extended regular expressions then you can prevent longer extensions from being picked up. For example to limit the extension to 1-3 characters, you could use:
find . -type f -regextype posix-extended -regex ".*[.][a-zA-Z0-9]{1,3}$" | \
awk -F '.' '{ print $NF }' | sort -u
There are other ways to approach this, but given your initial example, this is a close follow-on.

You can use the following command for this purpose:
$find <DIR> -type f -print0 | xargs -0 -n1 basename | grep -Po '(?<=.)\..*$' | sort | uniq
.bak
.c
.file
.file.bak
.input
.input.bak
.log
.log.bak
.out
.out.bak
.test
.test.bak
.txt
.txt.bak
where the find command will look for all files under the <DIR> subtree pass them to basename to get only their filename without the path part (-0, and -print0 are used to take into account files with spaces in their names), then you grep only the part of the string that starts with a . (the extension .tar, .txt, .tar.gz) and also it ignores the hidden files with their name starting with .. After that you sort them and get only the unique values.
If you do not need the starting . in the extension name add
| sed 's/^\.//'

Related

Searching multiple files for list of words in a text file

I need to go through a huge amount of text files and list the ones that contains ALL of the words listed in another text file.
I need to list only the files containing all of the words. It does not have to be in any specific order. I've tried to use a variety of grep commands, but it only outputs the files containing any of the words, not all of them. It would be ideal to use the txt file containing the list of words as a search for grep.
Expected output is a list of just the files that succeed in the search (files that contains all the words from the "query" text file)
Tried
grep -Ffw word_list.txt /*.fas
find . -exec grep "word_list.txt" '{}' \; -print
I've found solutions using a number of pipes like
awk "/word1/&&/word2/&&/word3/" ./*.txt
find . -path '*.txt' -prune -o -type f -exec gawk '/word1/{a=1}/word2/{b=1}/word3/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
But I have a huge list of words and would be impractical.
Thank you.
Given sample files
file1.txt
word1
word2
word4
word5
file2.txt
word1
word2
word3
word4
file3.txt
word2
word3
word4
file4.txt
word0
word1
word2
word3
word4
file5.txt
word0
word1
word2
word3
word4
word5
This old-fashioned awk/shell code
#!/bin/bash
wordList="$1"
shift
awk -v wdListFile="$wordList" '
BEGIN{
dbg=0
while(getline < wdListFile > 0 ) {
words[$0]=$0
flags[$0]=0
numFlags++
}
}
{
if (dbg) { print "#dbg: myFile=" myFile " FILENAME=" FILENAME }
if (myFile != FILENAME) {
# a minor cost of extra reset on the first itteration in the run
if (dbg) { print "#dbg: inside flags reset" }
for (flg in flags) {
flags[flg]=0
}
}
for (i=1; i<=NF; i++) {
if (dbg) { print "#dbg: $i="$i }
if ($i in words) {
flags[$i]++
}
}
matchedCnt=0
for (f in flags) {
if (dbg) { print "#dbg: flags["f"]="flags[f] }
if (flags[f] > 0 ) {
matchedCnt++
if (dbg) { print "#dbg: incremeted matchedCnt to " matchedCnt}
}
}
if (dbg) {print "#dbg: Testing matchedCnt=" matchedCnt "==numFlags=" numFlags}
if (matchedCnt == numFlags) {
if (dbg) { print "All words found in "FILENAME "matchedCnt=" matchedCnt " numFlags=" numFlags}
print FILENAME
nextfile
}
myFile=FILENAME
if (dbg) { print "#dbg: myFile NOW=" myFile }
}' $#
Run from the command line as
./genGrep.sh wd.lst file*.txt
Produces the following output
file2.txt
file4.txt
file5.txt
One time only, make the script executable with
chmod 755 ./genGrep.sh
I would recommend making a copy of this file with dbg in the name, then take the original copy and delete all lines with dbg. This way you'll have a dbg version if you need it, but the dbg lines add an extra ~20% to reading the code.
Note that you can switch all dbging on by setting dbg=1 OR you can turn on individual lines by adding a ! char, i.e. if (! dbg) { ...}.
If for some reason you're running on really old Unix hardware, the nextfile command may not work. See if your system has gawk available, or get it installed.
I think there is an trick to getting nextfile behavior if it's not builtin, but I don't want to spend time researching that now.
Note that the use of the flags[] array, matchedCnt variable and the builtin awk function nextfile is designed to stop searching in a file once all words have been found.
You could also add a parameter to say "if n percent match, then print file name", but that comes with a consulting rate attached.
If you don't understand the stripped down awk code (removing the dbg sections), please work your way thur Grymoire's Awk Tutorial before asking questions.
Managing thousands of files (as you indicate) is a separate problem. But to get things going, I would call genGrep.sh wd.lst A* ; genGrep.sh wd.lst B*; ... and hope that works. The problem is that the command line has a limit of chars that can be processed at once in filename lists. So if A* expands to 1 billion chars, that you have to find a way to break up line size to something that the shell can process.
Typically, this is solved with xargs, so
find /path/to/files -name 'file*.txt' | xargs -I {} ./genGrep.sh wd.lst {}
Will find all the files that you specify by wildcard as demonstrated, from 1 or more /path/to/file that you list as the first argument to find.
All matching files are sent thru the pipe to xargs, which reads all files from list that one command invocation can process, and continues looping (not visible to you), until all files have been processed.
There are extra options to xargs that allow having multiple copies of ./genGrep.sh running, if you have the extra "cores" available on your computer. I don't want to get to deep into that, as I don't know if the rest of this is really going to work in your real-world use.
IHTH
It's a little hack as there is no direct way to do AND in grep.. We can using grep -E option to simulate AND.
grep -H -E "word1" *.txt| grep -H -E "word2" *.txt|grep -H -E "word3" *.txt | grep -H -E "word4" *.txt| cut -d: -f1
-H => --with-filename
-E => --extended-regexp
cut -d: -f1 => to print only the file name.
Try something like:
WORD_LIST=file_with_words.txt
FILES_LIST=file_with_files_to_search.txt
RESULT=file_with_files_containing_all_words.txt
# Generate a list of files to search and store as provisional result
# You can use find, ls, or any other way you find useful
find . > ${RESULT}
# Now perform the search for every word
for WORD in $(<${WORD_LIST}); do
# Remove any previous file list
rm -f ${FILES_LIST}
# Set the provisional result as the new starting point
mv ${RESULT} ${FILES_LIST}
# Do a grep on this file list and keep only the files that
# contain this particular word (and all the previous ones)
cat ${FILES_LIST} | xargs grep -l > $RESULT
done
# Clean up temporary files
rm -f ${FILES_LIST}
At this point you should have in $RESULTS the list of files that contain all the words in ${WORD_LIST}.
This operation is costly, as you have to read all the (still) candidate files again and again for every word you check, so try to put the less frequent words in the first place in the ${WORD_LIST} so you will drop as many files as possible from the checking as soon as possible.

Listing most recent files whose total sizes are about a certain value

I would like to copy the most recent files in a directory to another directory such that the total sizes of the files I would like to copy should be 10gb, let say.
I know that I can list the most recent 10 files with a certain amount of size like this:
find . -maxdepth 1 -type f -size +100M -print0 | xargs -0 ls -Shal | head
But, is there any way to find the most recent files whose total sizes are about 10gb?
Thanks
Yes, that is possible. find has a printf action that allows you to output only the information you are interested in. In your case, that would be a timestamp (e.g. last modification time), the file size, and the name of the file. You can then sort the output according to the timestamp, and use awk to sum the file sizes and output the file names up to a certain limit:
find "$some_directory" -printf "%T# %s %p\n" | sort -nr \
| awk '{ a = a + $2; if (a > 10000) { print a; exit; }; print $3; }'
Adjust the limit according to your needs, and remove print a if you are not interested in the result. If you want to include the file that pushes the sum over the limit, replace print a with print $3.

find a pattern and print line based on finding the first pattern sed, awk grep

I have a rather large file. What is common to all is the hostname to break each section example :
HOSTNAME:host1
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 2
HOSTNAME:host-2
data 1
data here
data 2
text here
section 1
text here
part 4
data here
comm = 1
The above prints
As you see above, in between each section there are other sections broken down by key words or lines that have specific values
I like to use a oneliner to print host name for each section and then print which ever lines I want to extract under each hostname section
Can you please help. I am using now grep -C 10 HOSTNAME | gerp -C pattern
but this assumes that there are 10 lines in each section. This is not an optimal way to do this; can someone show a better way. I also need to be able to print more than one line under each pattern that I find . So if I find data1 and there are additional lines under it I like to grab and print them
So output of command would be like
grep -C 10 HOSTNAME | grep data 1
grep -C 10 HOSTNAME | grep -A 2 data 1
HOSTNAME:Host1
data 1
HOSTNAME:Hoss2
data 1
Beside Grep I use this sed command to print my output
sed -r '/HOSTNAME|shared/!d' filename
The only problem with this sed command is that it only prints the lines that have patterns shared & HOSTNAME in them. I also need to specify the number of lines I like to print in my case under the line that matched patterns shared. So I like to print HOSTNAME and give the number of lines I like to print under second search pattern shared.
Thanks
awk to the rescue!
$ awk -v lines=2 '/HOSTNAME/{c=lines} NF&&c&&c--' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
print lines number of lines including pattern match, skips empty lines.
If you want to specify secondary keyword instead number of lines
$ awk -v key='data 1' '/HOSTNAME/{h=1; print} h&&$0~key{print; h=0}' file
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
Here is a sed twoliner:
sed -n -r '/HOSTNAME/ { p }
/^\s+data 1/ {p }' hostnames.txt
It prints (p)
when the line contains a HOSTNAME
when the line starts with some whitespace (\s+) followed by your search criterion (data 1)
non-mathing lines are not printed (due to the sed -n option)
Edit: Some remarks:
this was tested with GNU sed 4.2.2 under linux
you dont need the -r if your sed version does not support it, replace the second pattern to /^.*data 1/
we can squash everything in one line with ;
Putting it all together, here is a revised version in one line, without the need for the extended regex ( i.e without -r):
sed -n '/HOSTNAME/ { p } ; /^.*data 1/ {p }' hostnames.txt
The OP requirements seem to be very unclear, but the following is consistent with one interpretation of what has been requested, and more importantly, the program has no special requirements, and the code can easily be modified to meet a variety of requirements. In particular, both search patterns (the HOSTNAME pattern and the "data 1" pattern) can easily be parameterized.
The main idea is to print all lines in a specified subsection, or at least a certain number up to some limit.
If there is a limit on how many lines in a subsection should be printed, specify a value for limit, otherwise set it to 0.
awk -v limit=0 '
/^HOSTNAME:/ { subheader=0; hostname=1; print; next}
/^ *data 1/ { subheader=1; print; next }
/^ *data / { subheader=0; next }
subheader && (limit==0 || (subheader++ < limit)) { print }'
Given the lines provided in the question, the output would be:
HOSTNAME:host1
data 1
HOSTNAME:host-2
data 1
(Yes, I know the variable 'hostname' in the awk program is currently unused, but I included it to make it easy to add a test to satisfy certain obvious requirements regarding the preconditions for identifying a subheader.)
sed -n -e '/hostname/,+p' -e '/Duplex/,+p'
The simplest way to do it is to combine two sed commands ..

Remove a null character (Shell Script)

I've looked everywhere and I'm out of luck.
I am trying to count the files in my current directory and all sub directories so that when I run the shell script count_files.sh it will produce a similar output to:
$
2 sh
4 html
1 css
2 noexts
(EDIT the above output should have each count and extension on a newline)
$
where noexts are either files without any period as an extension (ex: fileName ) or files with a period but no extension (ex: fileName. ).
this pipeline:
find * | awf -F . '{print $NF}'
gives me a comprehensive list of all the files, and I've figured out how to remove files without any period (ex: fileName ) using sed '/\//d'
MY ISSUE is that I cannot remove the files from the output of the above pipeline that are separated by a period but have NULL after the period (ex: fileName. ), as it is separated by the delimiter '.'
How can I use sed like above to remove a null character from a pipe input?
I understand this could be a quick fix, but I've been googling like a madman with no luck. Thanks in advance.
Chip
To filter filenames that end with ., since filenames are the whole input line in find's output, you could use
sed '/\.$/d'
Where \. matches a literal dot and $ matches the end of the line.
However, I think I'd do the whole thing in awk. Since sorting does not appear to be necessary:
EDIT: Found a nicer way to do it with awk and find's -printf action.
find . -type f -printf '%f\n' | awk -F. '!/\./ || $NF == "" { ++count["noext"]; next } { ++count[$NF] } END { for(k in count) { print k " " count[k] } }'
Here we pass -printf '%f\n' to find to make it print only the file name without the preceding directory, which makes it much easier to work with for our purposes -- this way there's no need to worry about periods in directory names (such as /etc/somethingorother.d). The field separator is '.', the awk code is
!/\./ || $NF == "" { # if the line (the filename) does not contain
# a period or there's nothing after the last .
++count["noext"] # increment the "noext" counter
# note that this will be collated with files that
# have ".noext" as filename extension. see below.
next # go to the next line
}
{ # in all other lines
++count[$NF] # increment the counter for the file extension
}
END { # in the very end:
for(k in count) { # print the counters.
print count[k] " " k
}
}
Note that this way, if there is a file "foo.noext", it will be counted among the files without a filename extension. If this is a worry, use a special counter for files without an extension -- either apart from the array or with a key that cannot be a filename extension (such as one that includes a . or the empty string).

Count occurence of character in files

I want to count all $ characters in each file in a directory with several subdirectories.
My goal is to count all variables in a PHP project. The files have the suffix .php.
I tried
grep -r '$' . | wc -c
grep -r '$' . | wc -l
and a lot of other stuff but all returned a number that can not match. In my example file are only four $.
So I hope someone can help me.
EDIT
My example file
<?php
class MyClass extends Controller {
$a;$a;
$a;$a;
$a;
$a;
To recursively count the number of $ characters in a set of files in a directory you could do:
fgrep -Rho '$' some_dir | wc -l
To include only files of extension .php in the recursion you could instead use:
fgrep -Rho --include='*.php' '$' some_dir | wc -l
The -R is for recursively traversing the files in some_dir and the -o is for matching part of the each line searched. The set of files are restricted to the pattern *.php and file names are not included in the output with -h, which may otherwise have caused false positives.
For counting variables in a PHP project you can use the variable regex defined here.
So, the next will grep all variables for each file:
cd ~/my/php/project
grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
-P - use perlish regex
-r - recursive
-o - each match on separate line
will produce something like:
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeMySQL.class.php:$driverId
./elFinderVolumeMySQL.class.php:$db
./elFinderVolumeMySQL.class.php:$tbf
You want count them, so you can use:
$ grep -Proc '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
and will get the count of variables in each file, like:
./connector.minimal.php:9
./connector.php:9
./elFinder.class.php:437
./elFinderConnector.class.php:46
./elFinderVolumeDriver.class.php:1343
./elFinderVolumeFTP.class.php:577
./elFinderVolumeFTPIIS.class.php:63
./elFinderVolumeLocalFileSystem.class.php:279
./elFinderVolumeMySQL.class.php:335
./mime.types:0
./MySQLStorage.sql:0
When want count by file and by variable, you can use:
$ grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
for getting result like:
17 ./elFinderVolumeLocalFileSystem.class.php:$target
8 ./elFinderVolumeLocalFileSystem.class.php:$targetDir
3 ./elFinderVolumeLocalFileSystem.class.php:$test
97 ./elFinderVolumeLocalFileSystem.class.php:$this
1 ./elFinderVolumeLocalFileSystem.class.php:$write
6 ./elFinderVolumeMySQL.class.php:$arc
3 ./elFinderVolumeMySQL.class.php:$bg
10 ./elFinderVolumeMySQL.class.php:$content
1 ./elFinderVolumeMySQL.class.php:$crop
where you can see, than the variable $write is used only once, so (maybe) it is useless.
You can also count per variable per whole project
$ grep -Proh '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
and will get something like:
13 $tree
1 $treeDeep
3 $trg
3 $trgfp
10 $ts
6 $tstat
35 $type
where you can see, than the $treeDeep is used only once in a whole project, so it is sure useless.
You can achieve many other combinations with different grep, sort and uniq commands..

Resources