Count occurence of character in files

Count occurence of character in files - linux

I want to count all $ characters in each file in a directory with several subdirectories.
My goal is to count all variables in a PHP project. The files have the suffix .php.
I tried
grep -r '$' . | wc -c
grep -r '$' . | wc -l
and a lot of other stuff but all returned a number that can not match. In my example file are only four $.
So I hope someone can help me.
EDIT
My example file
<?php
class MyClass extends Controller {
$a;$a;
$a;$a;
$a;
$a;

To recursively count the number of $ characters in a set of files in a directory you could do:
fgrep -Rho '$' some_dir | wc -l
To include only files of extension .php in the recursion you could instead use:
fgrep -Rho --include='*.php' '$' some_dir | wc -l
The -R is for recursively traversing the files in some_dir and the -o is for matching part of the each line searched. The set of files are restricted to the pattern *.php and file names are not included in the output with -h, which may otherwise have caused false positives.

For counting variables in a PHP project you can use the variable regex defined here.
So, the next will grep all variables for each file:
cd ~/my/php/project
grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
-P - use perlish regex
-r - recursive
-o - each match on separate line
will produce something like:
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeLocalFileSystem.class.php:$path
./elFinderVolumeMySQL.class.php:$driverId
./elFinderVolumeMySQL.class.php:$db
./elFinderVolumeMySQL.class.php:$tbf
You want count them, so you can use:
$ grep -Proc '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' .
and will get the count of variables in each file, like:
./connector.minimal.php:9
./connector.php:9
./elFinder.class.php:437
./elFinderConnector.class.php:46
./elFinderVolumeDriver.class.php:1343
./elFinderVolumeFTP.class.php:577
./elFinderVolumeFTPIIS.class.php:63
./elFinderVolumeLocalFileSystem.class.php:279
./elFinderVolumeMySQL.class.php:335
./mime.types:0
./MySQLStorage.sql:0
When want count by file and by variable, you can use:
$ grep -Pro '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
for getting result like:
17 ./elFinderVolumeLocalFileSystem.class.php:$target
8 ./elFinderVolumeLocalFileSystem.class.php:$targetDir
3 ./elFinderVolumeLocalFileSystem.class.php:$test
97 ./elFinderVolumeLocalFileSystem.class.php:$this
1 ./elFinderVolumeLocalFileSystem.class.php:$write
6 ./elFinderVolumeMySQL.class.php:$arc
3 ./elFinderVolumeMySQL.class.php:$bg
10 ./elFinderVolumeMySQL.class.php:$content
1 ./elFinderVolumeMySQL.class.php:$crop
where you can see, than the variable $write is used only once, so (maybe) it is useless.
You can also count per variable per whole project
$ grep -Proh '\$[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*' . | sort | uniq -c
and will get something like:
13 $tree
1 $treeDeep
3 $trg
3 $trgfp
10 $ts
6 $tstat
35 $type
where you can see, than the $treeDeep is used only once in a whole project, so it is sure useless.
You can achieve many other combinations with different grep, sort and uniq commands..

Related

How can I fix my bash script to find a random word from a dictionary?

I'm studying bash scripting and I'm stuck fixing an exercise of this site: https://ryanstutorials.net/bash-scripting-tutorial/bash-variables.php#activities
The task is to write a bash script to output a random word from a dictionary whose length is equal to the number supplied as the first command line argument.
My idea was to create a sub-dictionary, assign each word a number line, select a random number from those lines and filter the output, which worked for a similar simpler script, but not for this.
This is the code I used:
6 DIC='/usr/share/dict/words'
7 SUBDIC=$( egrep '^.{'$1'}$' $DIC )
8
9 MAX=$( $SUBDIC | wc -l )
10 RANDRANGE=$((1 + RANDOM % $MAX))
11
12 RWORD=$(nl "$SUBDIC" | grep "\b$RANDRANGE\b" | awk '{print $2}')
13
14 echo "Random generated word from $DIC which is $1 characters long:"
15 echo $RWORD
and this is the error I get using as input "21":
bash script.sh 21
script.sh: line 9: counterintelligence's: command not found
script.sh: line 10: 1 + RANDOM % 0: division by 0 (error token is "0")
nl: 'counterintelligence'\''s'$'\n''electroencephalograms'$'\n''electroencephalograph': No such file or directory
Random generated word from /usr/share/dict/words which is 21 characters long:
I tried in bash to split the code in smaller pieces obtaining no error (input=21):
egrep '^.{'21'}$' /usr/share/dict/words | wc -l
3
but once in the script line 9 and 10 give error.
Where do you think is the error?

problems
SUBDIC=$( egrep '^.{'$1'}$' $DIC ) will store all words of the given length in the SUBDIC variable, so it's content is now something like foo bar baz.
MAX=$( $SUBDIC | ... ) will try to run the command foo bar baz which is obviously bogus; it should be more like MAX=$(echo $SUBDIC | ... )
MAX=$( ... | wc -l ) will count the lines; when using the above mentioned echo $SUBDIC you will have multiple words, but all in one line...
RWORD=$(nl "$SUBDIC" | ...) same problem as above: there's only one line (also note #armali's answer that nl requires a file or stdin)
RWORD=$(... | grep "\b$RANDRANGE\b" | ...) might match the dictionary entry catch 22
likely RWORD=$(... | awk '{print $2}') won't handle lines containing spaces
a simple solution
doing a "random sort" over the all the possible words and taking the first line, should be sufficient:
egrep "^.{$1}$" "${DIC}" | sort -R | head -1

MAX=$( $SUBDIC | wc -l ) - A pipe is used for connecting a command's output, while $SUBDIC isn't a command; an appropriate syntax is MAX=$( <<<$SUBDIC wc -l ).
nl "$SUBDIC" - The argument to nl has to be a filename, which "$SUBDIC" isn't; an appropriate syntax is nl <<<"$SUBDIC".

This code will do it. My test dictionary of words is in file file. It's a good idea to get all words of a given length first but put them in an array not in var. And then get a random index and echo it.
dic=( $(sed -n "/^.\{$1\}$/p" file) )
ind=$((0 + RANDOM % ${#dic[#]}))
echo ${dic[$ind]}

I am also doing this activity and I create one simple solution.
I create the script.
#!/bin/bash
awk "NR==$1 {print}" /usr/share/dict/words
Here if you want a random string then you have to run the script as per the below command from the terminal.
./script.sh $RANDOM
If you want the print any specific number string then you can run as per the below command from the terminal.
./script.sh 465

cat /usr/share/dict/american-english | head -n $RANDOM | tail -n 1
$RANDOM - Returns a different random number each time is it referred to.
this simple line outputs random word from the mentioned dictionary.
Otherwise as umläute mentined you can do:
cat /usr/share/dict/american-english | sort -R | head -1

Searching multiple files for list of words in a text file

I need to go through a huge amount of text files and list the ones that contains ALL of the words listed in another text file.
I need to list only the files containing all of the words. It does not have to be in any specific order. I've tried to use a variety of grep commands, but it only outputs the files containing any of the words, not all of them. It would be ideal to use the txt file containing the list of words as a search for grep.
Expected output is a list of just the files that succeed in the search (files that contains all the words from the "query" text file)
Tried
grep -Ffw word_list.txt /*.fas
find . -exec grep "word_list.txt" '{}' \; -print
I've found solutions using a number of pipes like
awk "/word1/&&/word2/&&/word3/" ./*.txt
find . -path '*.txt' -prune -o -type f -exec gawk '/word1/{a=1}/word2/{b=1}/word3/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
But I have a huge list of words and would be impractical.
Thank you.

Given sample files
file1.txt
word1
word2
word4
word5
file2.txt
word1
word2
word3
word4
file3.txt
word2
word3
word4
file4.txt
word0
word1
word2
word3
word4
file5.txt
word0
word1
word2
word3
word4
word5
This old-fashioned awk/shell code
#!/bin/bash
wordList="$1"
shift
awk -v wdListFile="$wordList" '
BEGIN{
dbg=0
while(getline < wdListFile > 0 ) {
words[$0]=$0
flags[$0]=0
numFlags++
}
}
{
if (dbg) { print "#dbg: myFile=" myFile " FILENAME=" FILENAME }
if (myFile != FILENAME) {
# a minor cost of extra reset on the first itteration in the run
if (dbg) { print "#dbg: inside flags reset" }
for (flg in flags) {
flags[flg]=0
}
}
for (i=1; i<=NF; i++) {
if (dbg) { print "#dbg: $i="$i }
if ($i in words) {
flags[$i]++
}
}
matchedCnt=0
for (f in flags) {
if (dbg) { print "#dbg: flags["f"]="flags[f] }
if (flags[f] > 0 ) {
matchedCnt++
if (dbg) { print "#dbg: incremeted matchedCnt to " matchedCnt}
}
}
if (dbg) {print "#dbg: Testing matchedCnt=" matchedCnt "==numFlags=" numFlags}
if (matchedCnt == numFlags) {
if (dbg) { print "All words found in "FILENAME "matchedCnt=" matchedCnt " numFlags=" numFlags}
print FILENAME
nextfile
}
myFile=FILENAME
if (dbg) { print "#dbg: myFile NOW=" myFile }
}' $#
Run from the command line as
./genGrep.sh wd.lst file*.txt
Produces the following output
file2.txt
file4.txt
file5.txt
One time only, make the script executable with
chmod 755 ./genGrep.sh
I would recommend making a copy of this file with dbg in the name, then take the original copy and delete all lines with dbg. This way you'll have a dbg version if you need it, but the dbg lines add an extra ~20% to reading the code.
Note that you can switch all dbging on by setting dbg=1 OR you can turn on individual lines by adding a ! char, i.e. if (! dbg) { ...}.
If for some reason you're running on really old Unix hardware, the nextfile command may not work. See if your system has gawk available, or get it installed.
I think there is an trick to getting nextfile behavior if it's not builtin, but I don't want to spend time researching that now.
Note that the use of the flags[] array, matchedCnt variable and the builtin awk function nextfile is designed to stop searching in a file once all words have been found.
You could also add a parameter to say "if n percent match, then print file name", but that comes with a consulting rate attached.
If you don't understand the stripped down awk code (removing the dbg sections), please work your way thur Grymoire's Awk Tutorial before asking questions.
Managing thousands of files (as you indicate) is a separate problem. But to get things going, I would call genGrep.sh wd.lst A* ; genGrep.sh wd.lst B*; ... and hope that works. The problem is that the command line has a limit of chars that can be processed at once in filename lists. So if A* expands to 1 billion chars, that you have to find a way to break up line size to something that the shell can process.
Typically, this is solved with xargs, so
find /path/to/files -name 'file*.txt' | xargs -I {} ./genGrep.sh wd.lst {}
Will find all the files that you specify by wildcard as demonstrated, from 1 or more /path/to/file that you list as the first argument to find.
All matching files are sent thru the pipe to xargs, which reads all files from list that one command invocation can process, and continues looping (not visible to you), until all files have been processed.
There are extra options to xargs that allow having multiple copies of ./genGrep.sh running, if you have the extra "cores" available on your computer. I don't want to get to deep into that, as I don't know if the rest of this is really going to work in your real-world use.
IHTH

It's a little hack as there is no direct way to do AND in grep.. We can using grep -E option to simulate AND.
grep -H -E "word1" *.txt| grep -H -E "word2" *.txt|grep -H -E "word3" *.txt | grep -H -E "word4" *.txt| cut -d: -f1
-H => --with-filename
-E => --extended-regexp
cut -d: -f1 => to print only the file name.

Try something like:
WORD_LIST=file_with_words.txt
FILES_LIST=file_with_files_to_search.txt
RESULT=file_with_files_containing_all_words.txt
# Generate a list of files to search and store as provisional result
# You can use find, ls, or any other way you find useful
find . > ${RESULT}
# Now perform the search for every word
for WORD in $(<${WORD_LIST}); do
# Remove any previous file list
rm -f ${FILES_LIST}
# Set the provisional result as the new starting point
mv ${RESULT} ${FILES_LIST}
# Do a grep on this file list and keep only the files that
# contain this particular word (and all the previous ones)
cat ${FILES_LIST} | xargs grep -l > $RESULT
done
# Clean up temporary files
rm -f ${FILES_LIST}
At this point you should have in $RESULTS the list of files that contain all the words in ${WORD_LIST}.
This operation is costly, as you have to read all the (still) candidate files again and again for every word you check, so try to put the less frequent words in the first place in the ${WORD_LIST} so you will drop as many files as possible from the checking as soon as possible.

Get distinct extension list Linux

I am new in Linux and currently I am facing a problem. I want to get list of extensions (.doc, .pdf) from a folder. I googled a lot and finally I get a solution which is given below :
find . -type f | awk -F. '!a[$NF]++{print $NF}'
I understand find . -type f, but unable to understand awk -F. '!a[$NF]++{print $NF}' what does it mean?
NF = Number of Fields in the current record
Can anyone explain?
Thanks in advance.

To answer your question what the awk line is doing :
As you already indicated, the line find . -type f returns a list of files located in the current directory. Eg.
./foo.ext1
./bar.ext2
./spam.ext2
./ham.ext3
./spam.ham.eggs
This list of files is send with a pipe to the command awk -F. '!a[$NF]++{print $NF}'. This awk line contains a lot of information. First of all you need to know that awk is a record parser where each record consists of a number of fields. The default record is a line while the default field separator is a sequence of spaces. So what does your awk line do now :
-F. :: this redefines the field separator to be a dot (.). From this point forward all lines in the example have now 2 fields (eg line 1 foo and ext1) while the last line has 3 fields (spam, ham and eggs).
NF :: this is an awk variable that returns the number of fields per record. It is clear that the extension is represented by the last field ($NF)
a[$NF] :: this is a array where the index is the extension. The default array value is zero unless you assign something to it.
a[$NF]++ :: this returns the current value of a[$NF] and increments the value with 1 after the return. Thus for line 1, a["ext1"]++ returns 0 and sets a["ext1"] to 1. While for line 3, a["ext2"]++ returns 1 and sets a["ext2"] to 2. This indicates that a[$NF] keeps track of the amount of times $NF appeared.
!a[$NF]++ :: this combines the logic of the above but checks if return value of a[$NF]++ is 0. If it is 0, return true otherwise return false. In case of line 2 of the example, This statement will return true because a["ext2"]++ has value 0. However, after the statement a["ext2"] has the value 1. When reading line 3, the statement will return false. In other words, have we seen $NF already? And while you answer this question with "yes" or "no" increment the count of $NF with one.
!a[$NF]++{print $NF} :: this combines everything. It essentially states, If !a[$NF]++ returns true, then print $NF, but before printing increment a[$NF] by one. Or in other words, If the field representing the extension ($NF) appears for the first time, print that field. If it has already appeared before, do nothing.
The incrementing of the array is important as it keeps track of what has been seen already. So line by line the following will happen
foo.ext1 => $NF="ext1", a["ext1"] is 0 so print $NF and set a["ext1"]=1
bar.ext2 => $NF="ext2", a["ext2"] is 0 so print $NF and set a["ext2"]=1
spam.ext2 => $NF="ext2", a["ext2"] is 1 so do not print and set a["ext2"]=2
ham.ext3 => $NF="ext3", a["ext3"] is 0 so print $NF and set a["ext3"]=1
spam.ham.eggs => $NF="eggs", a["eggs"] is 0 so print $NF and set a["eggs"]=1
The output is
ext1
ext2
ext3
eggs
General comments:
A file without any extensions al or not in a hidden directory (eg. ./path/to/awesome_filename_without_extension or ./path/to/.secret/filename_without_extension) or a part its full path printed as if it was the extension. The result however is meaning less, i.e.
/path/to/awesome_filename_without_extension
secret/awesome_filename_without_extension
This is best resolved as
find . -type f -exec basename -a '{}' + \
| awk -F. '((NF>1)&&(!a[$NF]++)){print $NF}'
Here the output of find is processed directly by basename which strips the directory from the filename. The awk line does one more check, do we have more then 1 field (i.e. is there an extension).

A very simple way of doing what you are attempting is to sort the output keeping only unique extensions, e.g.
find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort -u
if your sort doesn't support the -u option, then you can pipe the results of sort to uniq, e.g.
find . -type f -regex ".*[.][a-zA-Z0-9][a-zA-Z0-9]*$" | \
awk -F '.' '{ print $NF }' | sort | uniq
The -regex option limits the find selection to filenames with at least one ASCII character extension. However it will also pickup files without an extension if they contain a '.', e.g. foo.bar.fatcat would result in fatcat being included in the list.
You could adjust the regular expression to meet your needs. If your version of find supports posix-extended regular expressions then you can prevent longer extensions from being picked up. For example to limit the extension to 1-3 characters, you could use:
find . -type f -regextype posix-extended -regex ".*[.][a-zA-Z0-9]{1,3}$" | \
awk -F '.' '{ print $NF }' | sort -u
There are other ways to approach this, but given your initial example, this is a close follow-on.

You can use the following command for this purpose:
$find <DIR> -type f -print0 | xargs -0 -n1 basename | grep -Po '(?<=.)\..*$' | sort | uniq
.bak
.c
.file
.file.bak
.input
.input.bak
.log
.log.bak
.out
.out.bak
.test
.test.bak
.txt
.txt.bak
where the find command will look for all files under the <DIR> subtree pass them to basename to get only their filename without the path part (-0, and -print0 are used to take into account files with spaces in their names), then you grep only the part of the string that starts with a . (the extension .tar, .txt, .tar.gz) and also it ignores the hidden files with their name starting with .. After that you sort them and get only the unique values.
If you do not need the starting . in the extension name add
| sed 's/^\.//'

Compare all files in a folder

I've got script in crontab which creates every 30 minutes files with list of Offline peers in asterisk:
now=$(date +"%Y%m%d%H%M")
/usr/sbin/asterisk -rx 'sip show peers' | grep "Unspec" | sed 's/[/].*//' >> /var/log/asterisk/offline/offline_$now
I need to parse theese files and find extensions that were always offline, i.e. stings in files that were constant.
How can I do this?
Output is:
/usr/sbin/asterisk -rx 'sip show peers' | grep "Unspec" | sed 's/[/].*//' | tail -3
891
894
899
ls /var/log/asterisk/offline/
offline_201309051400 offline_201309051418 offline_201309051530 offline_201309051700
offline_201309051830 offline_201309052000 offline_201309052130
offline_201309051405 offline_201309051430 offline_201309051600 offline_201309051730
offline_201309051900 offline_201309052030 offline_201309052200
offline_201309051406 offline_201309051500 offline_201309051630 offline_201309051800
offline_201309051930 offline_201309052100 offline_201309052230

This awk script will print the lines that are present in all of the files:
awk 'FNR==1{f++}{a[$0]++}END{for (i in a) if (a[i]==f) print i}' offline_*
How it works:
With FNR==1{f++} we count the number of files that are parsed (FNR is equal to one for the first line of each file)
with {a[$0]++} we count how many times each line has appeared.
the END block prints the elements of the array that have been found f times.

grouping lines from a txt file using filters in Linux to create multiple txt files

I have a txt file, where each line starts with participant No, followed by the date and other variables (numbers only), so has format:
S001_2 20090926 14756 93
S002_2 20090803 15876 13
I want to write a script that creates smaller txt files containing only 20 participants per file (so first one will contain lines from S001_2 to S020_2;second from S021_2 to S040_2; total number of subjects approximately 200). However, subjects are not organized, therefore I can`t set a range with sed.
What would be the best command to filter ppts into chunks depending on what number (SOO1_2) the line starts with?
Thanks in advance.

Use the split command to split a file (or a filtered result) without ranges and sed. According to the documentation, this should work:
cat file.txt | split -l 20 - PREFIX
This will produce the files PREFIXaa, PREFIXab, ... (Note that it does not add the .txt extension to the file name!)
If you want to filter the files first, in the way #Sergey described:
cat file.txt | sort | split -l 20 - PREFIX

Sort without any parameters should be suitable, because there are leading zeros in your numbers like S001_2. So, first sort the file:
sort file.txt > sorted.txt
Then you will be able to set ranges with sed for file_sort.txt
This looks like a whole script for splitting sorted file into 20-line files:
num=1;
i=1;
lines=`wc -l sorted.txt | cut -d' ' -f 1`;#get number of lines
while [ $i -lt $lines ];do
sed -n $i,`echo $i+19 | bc`p sorted.txt > file$num;
num=`echo $num+1 | bc`;
i=`echo $i+20 | bc`;
done;

$ split -d -l 20 file.txt -a3 db_
produces: db_000, db_001, db_002, ..., db_N

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Count occurence of character in files - linux

Related

How can I fix my bash script to find a random word from a dictionary?

Searching multiple files for list of words in a text file

Get distinct extension list Linux

Compare all files in a folder

grouping lines from a txt file using filters in Linux to create multiple txt files

Categories

Resources